This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopAccessAnalysis.h
-
lib/
-
Analysis/
-
LoopAccessAnalysis.cpp
-
Transforms/
-
Scalar/
-
LoopVersioningLICM.cpp
-
Vectorize/
-
LoopVectorizationPlanner.h
-
LoopVectorize.cpp
-
test/
-
LTO/X86/
-
X86/
-
diagnostic-handler-remarks.ll
-
Transforms/LoopVectorize/
-
LoopVectorize/
-
AArch64/
-
runtime-check-size-based-threshold.ll
-
X86/
-
runtime-limit.ll

Differential D109296

[LV] Improve inclusivity of vectorization
AbandonedPublic

Authored by lebedev.ri on Sep 5 2021, 12:13 PM.

Download Raw Diff

Details

Reviewers

fhahn
ebrevnov
anemet
asbirlea
MaskRay
aeubanks
wenlei
davidxl
nikic
dmgreen
Ayal
dorit

Summary

Right now, LoopVectorizer has a hard limit on the number of runtime memory checks.
The limit is currently at 8, and while it generally works reasonably well,
as with all arbitrary limits, it's an arbitrary limit.

There are several problems with it:

It puts a hard cap on the complexity of the loop it will vectorize Naturally, generally, the more pointer arithmetic/"objects" you have, the more checks are needed
The number of runtime memory checks doesn't actually correlate with the overhead incurred by them. I've checked locally, and a single check can have a cost from 4 to 25...
Why do we have this hard limit anyways? I guess because we want to avoid generating too many checks?
How do we come up with the current limit?

Therefore, i would like to propose to completely change the approach here,
and to instead specify the budged for said checks in terms of multiples of cost
of a single iteration of the original scalar loop.

That is, if the cost of a single iteration of the original scalar loop is 10,
and the Multiple is 2, then the budged for the runtime checks is 10*2 = 20.

Currently i have looked for the optimal value for this threshold on RawSpeed and darktable,
and the results may be interesting:
https://docs.google.com/spreadsheets/d/1b3VPU1tPYGq0AO3XH3kBv3zdpKMby8aJzFl2cLSZ5AQ/edit?usp=sharing
Just to preserve all the existing vectorizations, we'd need to allow the cost of run-time checks
to be not greater than the cost of 6 iterations of scalar loop.

I know pretty much all of the code there should vectorize, because i (re)wrote most of it.
Originally, it was just manually vectorized with SSE2, but i've added plain fallbacks.

This is motivated by the bugreport https://bugs.llvm.org/show_bug.cgi?id=44662 i have filed
almost two years ago now. The code is inspired by/based on the code by @fhahn in D75981,
but unfortunately that patch is rather stuck, and vectorization area of llvm appears to be
a walled garden without much outside-of-the-club contributions, with latter being busy,
so i don't have much hope here :S

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 8a0999ddb98c..f4495cba57f5 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8186,6 +8186,12 @@ LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC,
   // Check if it is profitable to vectorize with runtime checks.
   if (SelectedVF.Width.getKnownMinValue() > 1 &&
       Requirements.getNumRuntimePointerChecks()) {
+    errs() << "LV LAA num " << Requirements.getNumRuntimePointerChecks()
+           << " RTCost " << Checks.getCost(CM) << " ScalarLoopCost "
+           << SelectedVF.ScalarCost.getValue().getValue() << " fraction "
+           << (double)Checks.getCost(CM) /
+                  SelectedVF.ScalarCost.getValue().getValue()
+           << "\n";
     if (Checks.getCost(CM) >
         VectorizeMemoryCheckFactor * (*SelectedVF.ScalarCost.getValue())) {
       ORE->emit([&]() {

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	420 ms	x64 debian > Clang.Frontend::optimization-remark-options.c
	810 ms	x64 debian > libomp.lock::omp_init_lock.c
	490 ms	x64 windows > Clang.Frontend::optimization-remark-options.c

Event Timeline

lebedev.ri created this revision.Sep 5 2021, 12:13 PM

Herald added subscribers: ormris, bmahjour, steven_wu, hiraditya. · View Herald TranscriptSep 5 2021, 12:13 PM

lebedev.ri requested review of this revision.Sep 5 2021, 12:13 PM

Harbormaster completed remote builds in B122697: Diff 370817.Sep 5 2021, 12:58 PM

Shouldn't this be taking the trip count into account somehow? Generating 7 times the scalar loop in run-time checks seems okay if you have thousands of iterations, but would be bad for a cold loop with a few iterations. After a quick look at D75981 that variant does look at the TC. From the patch description, it's not entirely clear to me why a fixed factor of the scalar loop cost is preferable.

(Though as a comment on D75981, it uses getSmallBestKnownTC, which may return getSmallConstantMaxTripCount, which I believe will commonly just be INT_MAX for loops that are finite but have unknown trip count, so it may be drastically overestimating the TC for loops without exact trip count or profile data.)

Thank you for taking a look!

In D109296#2984302, @nikic wrote:

Shouldn't this be taking the trip count into account somehow? Generating 7 times the scalar loop in run-time checks seems okay if you have thousands of iterations, but would be bad for a cold loop with a few iterations. After a quick look at D75981 that variant does look at the TC. From the patch description, it's not entirely clear to me why a fixed factor of the scalar loop cost is preferable.

(Though as a comment on D75981, it uses getSmallBestKnownTC, which may return getSmallConstantMaxTripCount, which I believe will commonly just be INT_MAX for loops that are finite but have unknown trip count, so it may be drastically overestimating the TC for loops without exact trip count or profile data.)

You answered your question in your last paragraph.
I'm fundamentally interested in variable trip count loops,
where we don't know the actual constant backedge-taken count,
so we simply can't take backedge-taken count into account here.

• hafixo added a commit: rCRT373035: hwasan: Compatibility fixes for short granules..Sep 6 2021, 12:44 AM

• hafixo added a commit: rGc336557f0238: hwasan: Compatibility fixes for short granules..Sep 6 2021, 12:47 AM

I have figured out what was bothering me about counting checks - we are actually counting check *groups*,
where each group intentionally doesn't nessesairly contain only a single pointer, so the current design
doesn't account for the min/max reduction of pointers within group.

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.
I think the latter check is obviously cheaper than the former one :)
Those are real-world occurrences, not synthetic numbers.

One important epiphany occurred to me: with whatever formula we come up with,
it should not contain VF or cost of vectorized loop body,
because we'll pick threshold for vectorization using e.g. VF=8/AVX2,
but that will naturally result in smaller budget for checks
when LV will pick e.g VF=4/SSE4.2 as the best strategy.

As it stands right now, i think we roughly have 3 possible solutions:

"it ain't broken, let's not fix it." i think it's pretty obviously broken.
instead of counting the number of groups, count the total number of members. We could do that, but is not guaranteed to correlate with the final cost of checks, see RuntimeCheckingPtrGroup::addPointer() I don't think this is the right fix
Just hardcode the budget. Might be a better approach than what we do now. I think then it should be ~160.
My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.
??? Does anyone have better idea?

Matt added a subscriber: Matt.Sep 6 2021, 5:42 PM

thopre removed a commit: rGc336557f0238: hwasan: Compatibility fixes for short granules..Sep 7 2021, 2:47 AM

thopre removed a commit: rCRT373035: hwasan: Compatibility fixes for short granules..Sep 7 2021, 2:51 AM

Unfortunately I do not have the bandwidth or energy to get D75981 unstuck, as there are a few competing requests and my original plan was to fix the case where we have most information to make the best decision first (trip count available) and have people look into heuristics for their cases. However, most of the required infrastructure changes already landed and I updated D75981 to just contain the remaining non-heuristic changes.

Regardless of which heuristics we choose, I think we should not generate runtime checks if we can prove that RT + vector loop is more expensive than the scalar loop. I put up D109368 to guard against that.

instead of counting the number of groups, count the total number of members. We could do that, but is not guaranteed to correlate with the final cost of checks, see RuntimeCheckingPtrGroup::addPointer() I don't think this is the right fix

I don't think that's a good idea, as it does not consider the cost of other runtime checks (like those generated for SCEV predicates required for the memory checks). If we want to make better cost based decisions, we should definitely include all aspects of the runtime checks we generate.

Just hardcode the budget. Might be a better approach than what we do now. I think then it should be ~160.

As in a cost of 160 in terms of InstructionCost?

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

In D109296#2986995, @fhahn wrote:

Unfortunately I do not have the bandwidth or energy to get D75981 unstuck, as there are a few competing requests and my original plan was to fix the case where we have most information to make the best decision first (trip count available) and have people look into heuristics for their cases. However, most of the required infrastructure changes already landed and I updated D75981 to just contain the remaining non-heuristic changes.

Sure, i can understand that. Thank ypu!

Regardless of which heuristics we choose, I think we should not generate runtime checks if we can prove that RT + vector loop is more expensive than the scalar loop. I put up D109368 to guard against that.

If we know the constant trip count, or there is a guard info that specifies the maximal trip count you mean? (but *NOT* PGO data)
Sounds reasonable to me i think.

instead of counting the number of groups, count the total number of members. We could do that, but is not guaranteed to correlate with the final cost of checks, see RuntimeCheckingPtrGroup::addPointer() I don't think this is the right fix

I don't think that's a good idea, as it does not consider the cost of other runtime checks (like those generated for SCEV predicates required for the memory checks). If we want to make better cost based decisions, we should definitely include all aspects of the runtime checks we generate.

Yep, i remarked as much, this approach doesn't seem reasonable.

Just hardcode the budget. Might be a better approach than what we do now. I think then it should be ~160.

As in a cost of 160 in terms of InstructionCost?

Yep.

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

bmahjour removed a subscriber: bmahjour.Sep 7 2021, 9:15 AM

lebedev.ri mentioned this in D75981: [LV] Create RT checks once VF/IC are selected, track scalar cost..Sep 8 2021, 1:41 AM

fhahn added reviewers: Ayal, dorit.Sep 10 2021, 5:52 AM

Hi @Ayal, @dorit!
The road forward here isn't mapped, so i would *love* to hear some thoughts
on the general direction of this patch, and on the steps needed to move this forward.

There is a certain overlap with @ebrevnov's patches; while i'm obviously biased,
i would say those changes would look more natural building ontop of this change.

In D109296#2987011, @lebedev.ri wrote:

In D109296#2986995, @fhahn wrote

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC? With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12? I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

In D109296#2985531, @lebedev.ri wrote:

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

In D109296#3008361, @fhahn wrote:

In D109296#2987011, @lebedev.ri wrote:

In D109296#2986995, @fhahn wrote

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?

See the spreadsheet linked in the description, RTCost/ScalarCostPerIter is the new threshold, NumChecks is the old "threshold".
With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
But obviously this threshold is going going to vary somewhat per codebase,
as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.

I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
it's ~10M: (see VC5Decompressor::Wavelet::reconstructPass()::process() and VC5Decompressor::Wavelet::combineLowHighPass()::process())

VC5Decompressor.cpp.gcov.html107 KBDownload

... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
Perhaps PGO counters could be useful for that.

In D109296#2985531, @lebedev.ri wrote:

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

In D109296#3002798, @lebedev.ri wrote:

Hi @Ayal, @dorit!
The road forward here isn't mapped, so i would *love* to hear some thoughts
on the general direction of this patch, and on the steps needed to move this forward.

There is a certain overlap with @ebrevnov's patches; while i'm obviously biased,
i would say those changes would look more natural building ontop of this change.

I believe, these changes are quite independent. Well there are potential code conflicts (purely technical thing) and may be some cases when one affects the other but logically they are mostly orthogonal. Today cost model underestimates cost of vector version because it doesn't account for cost of RT checks and epilog loop. My patches try to fix that. It should help stop doing "definitely" unprofitable vectorization. This patch (and original version of D75981) introduces/replaces heuristic which tries to limit potential overhead (btw, should we take probability info into account as well? ) when/if vector version is skipped due to failed runtime checks. At least this is how I see it.

In D109296#3008367, @lebedev.ri wrote:

In D109296#3008361, @fhahn wrote:

In D109296#2987011, @lebedev.ri wrote:

In D109296#2986995, @fhahn wrote

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?

See the spreadsheet linked in the description, RTCost/ScalarCostPerIter is the new threshold, NumChecks is the old "threshold".
With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
But obviously this threshold is going going to vary somewhat per codebase,
as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.

I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
it's ~10M: (see VC5Decompressor::Wavelet::reconstructPass()::process() and VC5Decompressor::Wavelet::combineLowHighPass()::process())

VC5Decompressor.cpp.gcov.html107 KBDownload

... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
Perhaps PGO counters could be useful for that.

I wasn't really thinking of a pragma (although it might be helpful in some cases), but a new compiler option (like `-fhigh-trip-count-assumption)
PGO should definitely be helpful here.

In D109296#2985531, @lebedev.ri wrote:

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

In D109296#3022637, @fhahn wrote:

In D109296#3008367, @lebedev.ri wrote:

In D109296#3008361, @fhahn wrote:

In D109296#2987011, @lebedev.ri wrote:

In D109296#2986995, @fhahn wrote

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?

See the spreadsheet linked in the description, RTCost/ScalarCostPerIter is the new threshold, NumChecks is the old "threshold".
With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
But obviously this threshold is going going to vary somewhat per codebase,
as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.

I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
it's ~10M: (see VC5Decompressor::Wavelet::reconstructPass()::process() and VC5Decompressor::Wavelet::combineLowHighPass()::process())

VC5Decompressor.cpp.gcov.html107 KBDownload

... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
Perhaps PGO counters could be useful for that.

I wasn't really thinking of a pragma (although it might be helpful in some cases), but a new compiler option (like `-fhigh-trip-count-assumption)
PGO should definitely be helpful here.

Hmm, sounds interesting, but it is still more complex than "just works out of the box" (:

In D109296#2985531, @lebedev.ri wrote:

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?
That way we incur compile-time cost, filesize bloat, but not run-time cost.
If i got the idea right, that sounds rather too good to be true :) I really like it :)

In D109296#3022646, @lebedev.ri wrote:

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

In D109296#3022656, @fhahn wrote:

In D109296#3022646, @lebedev.ri wrote:

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

This sounds truly awesome.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

Well okay then :)
So i guess what i need to to is to rebase this patch ontop of D109368, and simply methodically exterminate! exterminate! the compile-time limits instead of redesigning them, correct?

In D109296#3022665, @lebedev.ri wrote:

In D109296#3022656, @fhahn wrote:

In D109296#3022646, @lebedev.ri wrote:

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

This sounds truly awesome.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

Well okay then :)
So i guess what i need to to is to rebase this patch ontop of D109368, and simply methodically exterminate! exterminate! the compile-time limits instead of redesigning them, correct?

For LV, the threshold should have been already be removed, but it doesn't move RuntimeMemoryCheckThreshold, which is a main difference to this patch AFAICT :) It would be great if you could verify it works as expected with rawspeed.

In D109296#3022682, @fhahn wrote:

In D109296#3022665, @lebedev.ri wrote:

In D109296#3022656, @fhahn wrote:

In D109296#3022646, @lebedev.ri wrote:

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

This sounds truly awesome.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

Well okay then :)
So i guess what i need to to is to rebase this patch ontop of D109368, and simply methodically exterminate! exterminate! the compile-time limits instead of redesigning them, correct?

For LV, the threshold should have been already be removed, but it doesn't move RuntimeMemoryCheckThreshold, which is a main difference to this patch AFAICT :) It would be great if you could verify it works as expected with rawspeed.

I've verified, and i can happily confirm that D109368 by itself vectorizes the problematic loops in question,
rendering this patch effectively obsolete. I will abandon it once D109368 lands
@fhahn thank you!

lebedev.ri mentioned this in D109368: [LV] Vectorize cases with larger number of RT checks, execute only if profitable..Sep 25 2021, 3:32 PM

ormris removed a subscriber: ormris.Jan 24 2022, 11:42 AM

lebedev.ri abandoned this revision.Oct 18 2022, 5:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2022, 5:46 PM

Herald added subscribers: • pcwang-thead, StephenFan. · View Herald Transcript

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

LoopAccessAnalysis.h

4 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

7 lines

Transforms/

Scalar/

LoopVersioningLICM.cpp

11 lines

Vectorize/

LoopVectorizationPlanner.h

14 lines

LoopVectorize.cpp

101 lines

test/

LTO/

X86/

diagnostic-handler-remarks.ll

2 lines

Transforms/

LoopVectorize/

AArch64/

runtime-check-size-based-threshold.ll

25 lines

X86/

runtime-limit.ll

12 lines

Diff 370817

llvm/include/llvm/Analysis/LoopAccessAnalysis.h

Show All 38 Lines	struct VectorizerParams {
static const unsigned MaxVectorWidth;		static const unsigned MaxVectorWidth;

/// VF as overridden by the user.		/// VF as overridden by the user.
static unsigned VectorizationFactor;		static unsigned VectorizationFactor;
/// Interleave factor as overridden by the user.		/// Interleave factor as overridden by the user.
static unsigned VectorizationInterleave;		static unsigned VectorizationInterleave;
/// True if force-vector-interleave was specified by the user.		/// True if force-vector-interleave was specified by the user.
static bool isInterleaveForced();		static bool isInterleaveForced();

/// \When performing memory disambiguation checks at runtime do not
/// make more than this number of comparisons.
static unsigned RuntimeMemoryCheckThreshold;
};		};

/// Checks memory dependences among accesses to the same underlying		/// Checks memory dependences among accesses to the same underlying
/// object to determine whether there vectorization is legal or not (and at		/// object to determine whether there vectorization is legal or not (and at
/// which vectorization factor).		/// which vectorization factor).
///		///
/// Note: This class will compute a conservative dependence for access to		/// Note: This class will compute a conservative dependence for access to
/// different underlying pointers. Clients, such as the loop vectorizer, will		/// different underlying pointers. Clients, such as the loop vectorizer, will
▲ Show 20 Lines • Show All 731 Lines • Show Last 20 Lines

llvm/lib/Analysis/LoopAccessAnalysis.cpp

	Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines
	static cl::opt<unsigned, true>			static cl::opt<unsigned, true>
	VectorizationInterleave("force-vector-interleave", cl::Hidden,			VectorizationInterleave("force-vector-interleave", cl::Hidden,
	cl::desc("Sets the vectorization interleave count. "			cl::desc("Sets the vectorization interleave count. "
	"Zero is autoselect."),			"Zero is autoselect."),
	cl::location(			cl::location(
	VectorizerParams::VectorizationInterleave));			VectorizerParams::VectorizationInterleave));
	unsigned VectorizerParams::VectorizationInterleave;			unsigned VectorizerParams::VectorizationInterleave;

	static cl::opt<unsigned, true> RuntimeMemoryCheckThreshold(
	"runtime-memory-check-threshold", cl::Hidden,
	cl::desc("When performing memory disambiguation checks at runtime do not "
	"generate more than this number of comparisons (default = 8)."),
	cl::location(VectorizerParams::RuntimeMemoryCheckThreshold), cl::init(8));
	unsigned VectorizerParams::RuntimeMemoryCheckThreshold;

	/// The maximum iterations used to merge memory checks			/// The maximum iterations used to merge memory checks
	static cl::opt<unsigned> MemoryCheckMergeThreshold(			static cl::opt<unsigned> MemoryCheckMergeThreshold(
	"memory-check-merge-threshold", cl::Hidden,			"memory-check-merge-threshold", cl::Hidden,
	cl::desc("Maximum number of comparisons done when trying to merge "			cl::desc("Maximum number of comparisons done when trying to merge "
	"runtime memory checks. (default = 100)"),			"runtime memory checks. (default = 100)"),
	cl::init(100));			cl::init(100));

	/// Maximum SIMD width.			/// Maximum SIMD width.
	▲ Show 20 Lines • Show All 2,215 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/LoopVersioningLICM.cpp

Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
#include <memory>		#include <memory>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "loop-versioning-licm"		#define DEBUG_TYPE "loop-versioning-licm"

static const char *LICMVersioningMetaData = "llvm.loop.licm_versioning.disable";		static const char *LICMVersioningMetaData = "llvm.loop.licm_versioning.disable";

		static cl::opt<unsigned> RuntimeMemoryCheckThreshold(
		"runtime-memory-check-threshold", cl::Hidden,
		cl::desc("When performing memory disambiguation checks at runtime do not "
		"generate more than this number of comparisons (default = 8)."),
		cl::init(8));

/// Threshold minimum allowed percentage for possible		/// Threshold minimum allowed percentage for possible
/// invariant instructions in a loop.		/// invariant instructions in a loop.
static cl::opt<float>		static cl::opt<float>
LVInvarThreshold("licm-versioning-invariant-threshold",		LVInvarThreshold("licm-versioning-invariant-threshold",
cl::desc("LoopVersioningLICM's minimum allowed percentage"		cl::desc("LoopVersioningLICM's minimum allowed percentage"
"of possible invariant instructions per loop"),		"of possible invariant instructions per loop"),
cl::init(25), cl::Hidden);		cl::init(25), cl::Hidden);

▲ Show 20 Lines • Show All 308 Lines • ▼ Show 20 Lines	bool LoopVersioningLICM::legalLoopInstructions() {
// Get LoopAccessInfo from current loop via the proxy.		// Get LoopAccessInfo from current loop via the proxy.
LAI = &GetLAI(CurLoop);		LAI = &GetLAI(CurLoop);
// Check LoopAccessInfo for need of runtime check.		// Check LoopAccessInfo for need of runtime check.
if (LAI->getRuntimePointerChecking()->getChecks().empty()) {		if (LAI->getRuntimePointerChecking()->getChecks().empty()) {
LLVM_DEBUG(dbgs() << " LAA: Runtime check not found !!\n");		LLVM_DEBUG(dbgs() << " LAA: Runtime check not found !!\n");
return false;		return false;
}		}
// Number of runtime-checks should be less then RuntimeMemoryCheckThreshold		// Number of runtime-checks should be less then RuntimeMemoryCheckThreshold
if (LAI->getNumRuntimePointerChecks() >		if (LAI->getNumRuntimePointerChecks() > RuntimeMemoryCheckThreshold) {
VectorizerParams::RuntimeMemoryCheckThreshold) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << " LAA: Runtime checks are more than threshold !!\n");		dbgs() << " LAA: Runtime checks are more than threshold !!\n");
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemarkMissed(DEBUG_TYPE, "RuntimeCheck",		return OptimizationRemarkMissed(DEBUG_TYPE, "RuntimeCheck",
CurLoop->getStartLoc(),		CurLoop->getStartLoc(),
CurLoop->getHeader())		CurLoop->getHeader())
<< "Number of runtime checks "		<< "Number of runtime checks "
<< NV("RuntimeChecks", LAI->getNumRuntimePointerChecks())		<< NV("RuntimeChecks", LAI->getNumRuntimePointerChecks())
<< " exceeds threshold "		<< " exceeds threshold "
<< NV("Threshold", VectorizerParams::RuntimeMemoryCheckThreshold);		<< NV("Threshold", RuntimeMemoryCheckThreshold);
});		});
return false;		return false;
}		}
// Loop should have at least one invariant load or store instruction.		// Loop should have at least one invariant load or store instruction.
if (!InvariantCounter) {		if (!InvariantCounter) {
LLVM_DEBUG(dbgs() << " Invariant not found !!\n");		LLVM_DEBUG(dbgs() << " Invariant not found !!\n");
return false;		return false;
}		}
▲ Show 20 Lines • Show All 240 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show All 28 Lines
namespace llvm {		namespace llvm {

class LoopInfo;		class LoopInfo;
class LoopVectorizationLegality;		class LoopVectorizationLegality;
class LoopVectorizationCostModel;		class LoopVectorizationCostModel;
class PredicatedScalarEvolution;		class PredicatedScalarEvolution;
class LoopVectorizationRequirements;		class LoopVectorizationRequirements;
class LoopVectorizeHints;		class LoopVectorizeHints;
		class GeneratedRTChecks;
class OptimizationRemarkEmitter;		class OptimizationRemarkEmitter;
class TargetTransformInfo;		class TargetTransformInfo;
class TargetLibraryInfo;		class TargetLibraryInfo;
class VPRecipeBuilder;		class VPRecipeBuilder;

/// VPlan-based builder utility analogous to IRBuilder.		/// VPlan-based builder utility analogous to IRBuilder.
class VPBuilder {		class VPBuilder {
VPBasicBlock *BB = nullptr;		VPBasicBlock *BB = nullptr;
▲ Show 20 Lines • Show All 133 Lines • ▼ Show 20 Lines

/// Information about vectorization costs.		/// Information about vectorization costs.
struct VectorizationFactor {		struct VectorizationFactor {
/// Vector width with best cost.		/// Vector width with best cost.
ElementCount Width;		ElementCount Width;
/// Cost of the loop with that width.		/// Cost of the loop with that width.
InstructionCost Cost;		InstructionCost Cost;

VectorizationFactor(ElementCount Width, InstructionCost Cost)		/// Cost of the scalar loop.
: Width(Width), Cost(Cost) {}		InstructionCost ScalarCost;

		VectorizationFactor(ElementCount Width, InstructionCost Cost,
		InstructionCost ScalarCost)
		: Width(Width), Cost(Cost), ScalarCost(ScalarCost) {}

/// Width 1 means no vectorization, cost 0 means uncomputed cost.		/// Width 1 means no vectorization, cost 0 means uncomputed cost.
static VectorizationFactor Disabled() {		static VectorizationFactor Disabled() {
return {ElementCount::getFixed(1), 0};		return {ElementCount::getFixed(1), 0, 0};
}		}

bool operator==(const VectorizationFactor &rhs) const {		bool operator==(const VectorizationFactor &rhs) const {
return Width == rhs.Width && Cost == rhs.Cost;		return Width == rhs.Width && Cost == rhs.Cost;
}		}

bool operator!=(const VectorizationFactor &rhs) const {		bool operator!=(const VectorizationFactor &rhs) const {
return !(*this == rhs);		return !(*this == rhs);
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	LoopVectorizationPlanner(Loop L, LoopInfo LI, const TargetLibraryInfo *TLI,
const LoopVectorizeHints &Hints,		const LoopVectorizeHints &Hints,
LoopVectorizationRequirements &Requirements,		LoopVectorizationRequirements &Requirements,
OptimizationRemarkEmitter *ORE)		OptimizationRemarkEmitter *ORE)
: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), IAI(IAI),		: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), IAI(IAI),
PSE(PSE), Hints(Hints), Requirements(Requirements), ORE(ORE) {}		PSE(PSE), Hints(Hints), Requirements(Requirements), ORE(ORE) {}

/// Plan how to best vectorize, return the best VF and its cost, or None if		/// Plan how to best vectorize, return the best VF and its cost, or None if
/// vectorization and interleaving should be avoided up front.		/// vectorization and interleaving should be avoided up front.
Optional<VectorizationFactor> plan(ElementCount UserVF, unsigned UserIC);		Optional<VectorizationFactor> plan(ElementCount UserVF, unsigned UserIC,
		GeneratedRTChecks &Checks);

/// Use the VPlan-native path to plan how to best vectorize, return the best		/// Use the VPlan-native path to plan how to best vectorize, return the best
/// VF and its cost.		/// VF and its cost.
VectorizationFactor planInVPlanNativePath(ElementCount UserVF);		VectorizationFactor planInVPlanNativePath(ElementCount UserVF);

/// Finalize the best decision and dispose of all other VPlans.		/// Finalize the best decision and dispose of all other VPlans.
void setBestPlan(ElementCount VF, unsigned UF);		void setBestPlan(ElementCount VF, unsigned UF);

▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines
/// Loops with a known constant trip count below this number are vectorized only		/// Loops with a known constant trip count below this number are vectorized only
/// if no scalar iteration overheads are incurred.		/// if no scalar iteration overheads are incurred.
static cl::opt<unsigned> TinyTripCountVectorThreshold(		static cl::opt<unsigned> TinyTripCountVectorThreshold(
"vectorizer-min-trip-count", cl::init(16), cl::Hidden,		"vectorizer-min-trip-count", cl::init(16), cl::Hidden,
cl::desc("Loops with a constant trip count that is smaller than this "		cl::desc("Loops with a constant trip count that is smaller than this "
"value are vectorized only if no scalar iteration overheads "		"value are vectorized only if no scalar iteration overheads "
"are incurred."));		"are incurred."));

static cl::opt<unsigned> PragmaVectorizeMemoryCheckThreshold(		static cl::opt<double> VectorizeMemoryCheckFactor(
"pragma-vectorize-memory-check-threshold", cl::init(128), cl::Hidden,		"vectorize-memory-check-factor", cl::Hidden,
cl::desc("The maximum allowed number of runtime memory checks with a "		cl::desc(
"vectorize(enable) pragma."));		"When performing memory disambiguation checks at runtime, the cost of "
		"the runtime memory checks themselves should not be larger than the "
		"cost of of N (default 7.0) scalar loop iterations."),
		cl::init(7.0));

// Option prefer-predicate-over-epilogue indicates that an epilogue is undesired,		// Option prefer-predicate-over-epilogue indicates that an epilogue is undesired,
// that predication is preferred, and this lists all options. I.e., the		// that predication is preferred, and this lists all options. I.e., the
// vectorizer will try to fold the tail-loop (epilogue) into the vector body		// vectorizer will try to fold the tail-loop (epilogue) into the vector body
// and predicate the instructions accordingly. If tail-folding fails, there are		// and predicate the instructions accordingly. If tail-folding fails, there are
// different fallback strategies depending on these values:		// different fallback strategies depending on these values:
namespace PreferPredicateTy {		namespace PreferPredicateTy {
enum Option {		enum Option {
▲ Show 20 Lines • Show All 205 Lines • ▼ Show 20 Lines	static Optional<unsigned> getSmallBestKnownTC(ScalarEvolution &SE, Loop *L) {

// Check if upper bound estimate is known.		// Check if upper bound estimate is known.
if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))		if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))
return ExpectedTC;		return ExpectedTC;

return None;		return None;
}		}

// Forward declare GeneratedRTChecks.
class GeneratedRTChecks;

namespace llvm {		namespace llvm {

/// InnerLoopVectorizer vectorizes loops which contain only one basic		/// InnerLoopVectorizer vectorizes loops which contain only one basic
/// block to a specified vectorization factor (VF).		/// block to a specified vectorization factor (VF).
/// This class performs the widening of scalars into vectors, or multiple		/// This class performs the widening of scalars into vectors, or multiple
/// scalars. This class also implements the following features:		/// scalars. This class also implements the following features:
/// * It inserts an epilogue loop for handling loops that don't have iteration		/// * It inserts an epilogue loop for handling loops that don't have iteration
/// counts that are known to be a multiple of the vectorization factor.		/// counts that are known to be a multiple of the vectorization factor.
▲ Show 20 Lines • Show All 1,193 Lines • ▼ Show 20 Lines	public:

/// Invalidates decisions already taken by the cost model.		/// Invalidates decisions already taken by the cost model.
void invalidateCostModelingDecisions() {		void invalidateCostModelingDecisions() {
WideningDecisions.clear();		WideningDecisions.clear();
Uniforms.clear();		Uniforms.clear();
Scalars.clear();		Scalars.clear();
}		}

		/// The vectorization cost is a combination of the cost itself and a boolean
		/// indicating whether any of the contributing operations will actually
		/// operate on vector values after type legalization in the backend. If this
		/// latter value is false, then all operations will be scalarized (i.e. no
		/// vectorization has actually taken place).
		using VectorizationCostTy = std::pair<InstructionCost, bool>;

		/// Returns the execution time cost of an instruction for a given vector
		/// width. Vector width of one means scalar.
		VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

private:		private:
unsigned NumPredStores = 0;		unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factors for both		/// \return An upper bound for the vectorization factors for both
/// fixed and scalable vectorization, where the minimum-known number of		/// fixed and scalable vectorization, where the minimum-known number of
/// elements is a power-of-2 larger than zero. If scalable vectorization is		/// elements is a power-of-2 larger than zero. If scalable vectorization is
/// disabled or unsupported, then the scalable part will be equal to		/// disabled or unsupported, then the scalable part will be equal to
/// ElementCount::getScalable(0).		/// ElementCount::getScalable(0).
Show All 12 Lines	ElementCount getMaximizedVFForTarget(unsigned ConstTripCount,
unsigned SmallestType,		unsigned SmallestType,
unsigned WidestType,		unsigned WidestType,
const ElementCount &MaxSafeVF);		const ElementCount &MaxSafeVF);

/// \return the maximum legal scalable VF, based on the safe max number		/// \return the maximum legal scalable VF, based on the safe max number
/// of elements.		/// of elements.
ElementCount getMaxLegalScalableVF(unsigned MaxSafeElements);		ElementCount getMaxLegalScalableVF(unsigned MaxSafeElements);

/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually
/// operate on vector values after type legalization in the backend. If this
/// latter value is false, then all operations will be scalarized (i.e. no
/// vectorization has actually taken place).
using VectorizationCostTy = std::pair<InstructionCost, bool>;

/// Returns the expected execution cost. The unit of the cost does		/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different		/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by		/// vector widths. The cost that is returned is not normalized by
/// the factor width. If \p Invalid is not nullptr, this function		/// the factor width. If \p Invalid is not nullptr, this function
/// will add a pair(Instruction*, ElementCount) to \p Invalid for		/// will add a pair(Instruction*, ElementCount) to \p Invalid for
/// each instruction that has an Invalid cost for the given VF.		/// each instruction that has an Invalid cost for the given VF.
using InstructionVFPair = std::pair<Instruction *, ElementCount>;		using InstructionVFPair = std::pair<Instruction *, ElementCount>;
VectorizationCostTy		VectorizationCostTy
expectedCost(ElementCount VF,		expectedCost(ElementCount VF,
SmallVectorImpl<InstructionVFPair> *Invalid = nullptr);		SmallVectorImpl<InstructionVFPair> *Invalid = nullptr);

/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
InstructionCost getInstructionCost(Instruction *I, ElementCount VF,		InstructionCost getInstructionCost(Instruction *I, ElementCount VF,
Type *&VectorTy);		Type *&VectorTy);

/// Return the cost of instructions in an inloop reduction pattern, if I is		/// Return the cost of instructions in an inloop reduction pattern, if I is
/// part of that pattern.		/// part of that pattern.
Optional<InstructionCost>		Optional<InstructionCost>
▲ Show 20 Lines • Show All 202 Lines • ▼ Show 20 Lines	public:
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;

/// All element types found in the loop.		/// All element types found in the loop.
SmallPtrSet<Type *, 16> ElementTypesInLoop;		SmallPtrSet<Type *, 16> ElementTypesInLoop;

/// Profitable vector factors.		/// Profitable vector factors.
SmallVector<VectorizationFactor, 8> ProfitableVFs;		SmallVector<VectorizationFactor, 8> ProfitableVFs;
};		};
} // end namespace llvm

/// Helper struct to manage generating runtime checks for vectorization.		/// Helper struct to manage generating runtime checks for vectorization.
///		///
/// The runtime checks are created up-front in temporary blocks to allow better		/// The runtime checks are created up-front in temporary blocks to allow better
/// estimating the cost and un-linked from the existing IR. After deciding to		/// estimating the cost and un-linked from the existing IR. After deciding to
/// vectorize, the checks are moved back. If deciding not to vectorize, the		/// vectorize, the checks are moved back. If deciding not to vectorize, the
/// temporary blocks are completely removed.		/// temporary blocks are completely removed.
class GeneratedRTChecks {		class GeneratedRTChecks {
▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	if (MemCheckBlock) {
LI->removeBlock(MemCheckBlock);		LI->removeBlock(MemCheckBlock);
}		}
if (SCEVCheckBlock) {		if (SCEVCheckBlock) {
DT->eraseNode(SCEVCheckBlock);		DT->eraseNode(SCEVCheckBlock);
LI->removeBlock(SCEVCheckBlock);		LI->removeBlock(SCEVCheckBlock);
}		}
}		}

		unsigned getCost(LoopVectorizationCostModel &CM) {
		unsigned RTCheckCost = 0;
		if (SCEVCheckBlock)
		for (Instruction &I : *SCEVCheckBlock) {
		if (SCEVCheckBlock->getTerminator() == &I)
		continue;
		RTCheckCost += *CM.getInstructionCost(&I, ElementCount::getFixed(1))
		.first.getValue();
		}
		if (MemCheckBlock)
		for (Instruction &I : *MemCheckBlock) {
		if (MemCheckBlock->getTerminator() == &I)
		continue;
		RTCheckCost += *CM.getInstructionCost(&I, ElementCount::getFixed(1))
		.first.getValue();
		}
		return RTCheckCost;
		}

/// Remove the created SCEV & memory runtime check blocks & instructions, if		/// Remove the created SCEV & memory runtime check blocks & instructions, if
/// unused.		/// unused.
~GeneratedRTChecks() {		~GeneratedRTChecks() {
SCEVExpanderCleaner SCEVCleaner(SCEVExp, *DT);		SCEVExpanderCleaner SCEVCleaner(SCEVExp, *DT);
SCEVExpanderCleaner MemCheckCleaner(MemCheckExp, *DT);		SCEVExpanderCleaner MemCheckCleaner(MemCheckExp, *DT);
if (!SCEVCheckCond)		if (!SCEVCheckCond)
SCEVCleaner.markResultUsed();		SCEVCleaner.markResultUsed();

▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	BasicBlock emitMemRuntimeChecks(Loop L, BasicBlock *Bypass,
MemCheckBlock->getTerminator()->setDebugLoc(		MemCheckBlock->getTerminator()->setDebugLoc(
Pred->getTerminator()->getDebugLoc());		Pred->getTerminator()->getDebugLoc());

// Mark the check as used, to prevent it from being removed during cleanup.		// Mark the check as used, to prevent it from being removed during cleanup.
MemRuntimeCheckCond = nullptr;		MemRuntimeCheckCond = nullptr;
return MemCheckBlock;		return MemCheckBlock;
}		}
};		};
		} // end namespace llvm

// Return true if \p OuterLp is an outer loop annotated with hints for explicit		// Return true if \p OuterLp is an outer loop annotated with hints for explicit
// vectorization. The loop needs to be annotated with #pragma omp simd		// vectorization. The loop needs to be annotated with #pragma omp simd
// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the		// simdlen(#) or #pragma clang vectorize(enable) vectorize_width(#). If the
// vector length information is not provided, vectorization is not considered		// vector length information is not provided, vectorization is not considered
// explicit. Interleave hints are not allowed either. These limitations will be		// explicit. Interleave hints are not allowed either. These limitations will be
// relaxed in the future.		// relaxed in the future.
// Please, note that we are currently forced to abuse the pragma 'clang		// Please, note that we are currently forced to abuse the pragma 'clang
▲ Show 20 Lines • Show All 1,183 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::emitMinimumIterationCountCheck(Loop *L,

ReplaceInstWithInst(		ReplaceInstWithInst(
TCCheckBlock->getTerminator(),		TCCheckBlock->getTerminator(),
BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters));		BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters));
LoopBypassBlocks.push_back(TCCheckBlock);		LoopBypassBlocks.push_back(TCCheckBlock);
}		}

BasicBlock InnerLoopVectorizer::emitSCEVChecks(Loop L, BasicBlock *Bypass) {		BasicBlock InnerLoopVectorizer::emitSCEVChecks(Loop L, BasicBlock *Bypass) {

BasicBlock *const SCEVCheckBlock =		BasicBlock *const SCEVCheckBlock =
RTChecks.emitSCEVChecks(L, Bypass, LoopVectorPreHeader, LoopExitBlock);		RTChecks.emitSCEVChecks(L, Bypass, LoopVectorPreHeader, LoopExitBlock);
if (!SCEVCheckBlock)		if (!SCEVCheckBlock)
return nullptr;		return nullptr;

assert(!(SCEVCheckBlock->getParent()->hasOptSize() \|\|		assert(!(SCEVCheckBlock->getParent()->hasOptSize() \|\|
(OptForSizeBasedOnProfile &&		(OptForSizeBasedOnProfile &&
Cost->Hints->getForce() != LoopVectorizeHints::FK_Enabled)) &&		Cost->Hints->getForce() != LoopVectorizeHints::FK_Enabled)) &&
▲ Show 20 Lines • Show All 2,720 Lines • ▼ Show 20 Lines
VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(		VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(
const ElementCountSet &VFCandidates) {		const ElementCountSet &VFCandidates) {
InstructionCost ExpectedCost = expectedCost(ElementCount::getFixed(1)).first;		InstructionCost ExpectedCost = expectedCost(ElementCount::getFixed(1)).first;
LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");		LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");
assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");		assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");
assert(VFCandidates.count(ElementCount::getFixed(1)) &&		assert(VFCandidates.count(ElementCount::getFixed(1)) &&
"Expected Scalar VF to be a candidate");		"Expected Scalar VF to be a candidate");

const VectorizationFactor ScalarCost(ElementCount::getFixed(1), ExpectedCost);		const VectorizationFactor ScalarCost(ElementCount::getFixed(1), ExpectedCost,
		ExpectedCost);
VectorizationFactor ChosenFactor = ScalarCost;		VectorizationFactor ChosenFactor = ScalarCost;

bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;		bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;
if (ForceVectorization && VFCandidates.size() > 1) {		if (ForceVectorization && VFCandidates.size() > 1) {
// Ignore scalar width, because the user explicitly wants vectorization.		// Ignore scalar width, because the user explicitly wants vectorization.
// Initialize cost to max so that VF = 2 is, at least, chosen during cost		// Initialize cost to max so that VF = 2 is, at least, chosen during cost
// evaluation.		// evaluation.
ChosenFactor.Cost = InstructionCost::getMax();		ChosenFactor.Cost = InstructionCost::getMax();
}		}

SmallVector<InstructionVFPair> InvalidCosts;		SmallVector<InstructionVFPair> InvalidCosts;
for (const auto &i : VFCandidates) {		for (const auto &i : VFCandidates) {
// The cost for scalar VF=1 is already calculated, so ignore it.		// The cost for scalar VF=1 is already calculated, so ignore it.
if (i.isScalar())		if (i.isScalar())
continue;		continue;

VectorizationCostTy C = expectedCost(i, &InvalidCosts);		VectorizationCostTy C = expectedCost(i, &InvalidCosts);
VectorizationFactor Candidate(i, C.first);		VectorizationFactor Candidate(i, C.first, ScalarCost.ScalarCost);
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Vector loop of width " << i << " costs: "		dbgs() << "LV: Vector loop of width " << i << " costs: "
<< (Candidate.Cost / Candidate.Width.getKnownMinValue())		<< (Candidate.Cost / Candidate.Width.getKnownMinValue())
<< (i.isScalable() ? " (assuming a minimum vscale of 1)" : "")		<< (i.isScalable() ? " (assuming a minimum vscale of 1)" : "")
<< ".\n");		<< ".\n");

if (!C.second && !ForceVectorization) {		if (!C.second && !ForceVectorization) {
LLVM_DEBUG(		LLVM_DEBUG(
▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	LLVM_DEBUG(
"not a supported candidate.\n";);		"not a supported candidate.\n";);
return Result;		return Result;
}		}

if (EpilogueVectorizationForceVF > 1) {		if (EpilogueVectorizationForceVF > 1) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n";);		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n";);
if (LVP.hasPlanWithVFs(		if (LVP.hasPlanWithVFs(
{MainLoopVF, ElementCount::getFixed(EpilogueVectorizationForceVF)}))		{MainLoopVF, ElementCount::getFixed(EpilogueVectorizationForceVF)}))
return {ElementCount::getFixed(EpilogueVectorizationForceVF), 0};		return {ElementCount::getFixed(EpilogueVectorizationForceVF), 0, 0};
else {		else {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LEV: Epilogue vectorization forced factor is not viable.\n";);		<< "LEV: Epilogue vectorization forced factor is not viable.\n";);
return Result;		return Result;
}		}
}		}

▲ Show 20 Lines • Show All 1,812 Lines • ▼ Show 20 Lines	if (!OrigLoop->isInnermost()) {
LLVM_DEBUG(dbgs() << "LV: Using " << (!UserVF.isZero() ? "user " : "")		LLVM_DEBUG(dbgs() << "LV: Using " << (!UserVF.isZero() ? "user " : "")
<< "VF " << VF << " to build VPlans.\n");		<< "VF " << VF << " to build VPlans.\n");
buildVPlans(VF, VF);		buildVPlans(VF, VF);

// For VPlan build stress testing, we bail out after VPlan construction.		// For VPlan build stress testing, we bail out after VPlan construction.
if (VPlanBuildStressTest)		if (VPlanBuildStressTest)
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();

return {VF, 0 /Cost/};		return {VF, 0 /Cost/, 0};
}		}

LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "		dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "
"VPlan-native path.\n");		"VPlan-native path.\n");
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();
}		}

Optional<VectorizationFactor>		Optional<VectorizationFactor>
LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC) {		LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC,
		GeneratedRTChecks &Checks) {
assert(OrigLoop->isInnermost() && "Inner loop expected.");		assert(OrigLoop->isInnermost() && "Inner loop expected.");
FixedScalableVFPair MaxFactors = CM.computeMaxVF(UserVF, UserIC);		FixedScalableVFPair MaxFactors = CM.computeMaxVF(UserVF, UserIC);
if (!MaxFactors) // Cases that should not to be vectorized nor interleaved.		if (!MaxFactors) // Cases that should not to be vectorized nor interleaved.
return None;		return None;

// Invalidate interleave groups if all blocks of loop will be predicated.		// Invalidate interleave groups if all blocks of loop will be predicated.
if (CM.blockNeedsPredication(OrigLoop->getHeader()) &&		if (CM.blockNeedsPredication(OrigLoop->getHeader()) &&
!useMaskedInterleavedAccesses(*TTI)) {		!useMaskedInterleavedAccesses(*TTI)) {
Show All 16 Lines	assert(isPowerOf2_32(UserVF.getKnownMinValue()) &&
"VF needs to be a power of two");		"VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
if (CM.selectUserVectorizationFactor(UserVF)) {		if (CM.selectUserVectorizationFactor(UserVF)) {
LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
CM.collectInLoopReductions();		CM.collectInLoopReductions();
buildVPlansWithVPRecipes(UserVF, UserVF);		buildVPlansWithVPRecipes(UserVF, UserVF);
LLVM_DEBUG(printPlans(dbgs()));		LLVM_DEBUG(printPlans(dbgs()));
return {{UserVF, 0}};		Checks.Create(OrigLoop, *Legal->getLAI(), PSE.getUnionPredicate());
		return {{UserVF, 0, 0}};
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - return {{UserVF, 0, 0}}; + return {{UserVF, 0, 0}}; Lint: Pre-merge checks: clang-format: please reformat the code ``` - return {{UserVF, 0, 0}}; + return {{UserVF…
} else		} else
reportVectorizationInfo("UserVF ignored because of invalid costs.",		reportVectorizationInfo("UserVF ignored because of invalid costs.",
"InvalidCost", ORE, OrigLoop);		"InvalidCost", ORE, OrigLoop);
}		}

// Populate the set of Vectorization Factor Candidates.		// Populate the set of Vectorization Factor Candidates.
ElementCountSet VFCandidates;		ElementCountSet VFCandidates;
for (auto VF = ElementCount::getFixed(1);		for (auto VF = ElementCount::getFixed(1);
Show All 19 Lines	LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC,

LLVM_DEBUG(printPlans(dbgs()));		LLVM_DEBUG(printPlans(dbgs()));
if (!MaxFactors.hasVector())		if (!MaxFactors.hasVector())
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();

// Select the optimal vectorization factor.		// Select the optimal vectorization factor.
auto SelectedVF = CM.selectVectorizationFactor(VFCandidates);		auto SelectedVF = CM.selectVectorizationFactor(VFCandidates);

		if (!SelectedVF.Width.isScalar())
		Checks.Create(OrigLoop, *Legal->getLAI(), PSE.getUnionPredicate());

// Check if it is profitable to vectorize with runtime checks.		// Check if it is profitable to vectorize with runtime checks.
unsigned NumRuntimePointerChecks = Requirements.getNumRuntimePointerChecks();		if (SelectedVF.Width.getKnownMinValue() > 1 &&
if (SelectedVF.Width.getKnownMinValue() > 1 && NumRuntimePointerChecks) {		Requirements.getNumRuntimePointerChecks()) {
bool PragmaThresholdReached =		if (Checks.getCost(CM) >
NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold;		VectorizeMemoryCheckFactor * (*SelectedVF.ScalarCost.getValue())) {
bool ThresholdReached =
NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold;
if ((ThresholdReached && !Hints.allowReordering()) \|\|
PragmaThresholdReached) {
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemarkAnalysisAliasing(		return OptimizationRemarkAnalysisAliasing(
DEBUG_TYPE, "CantReorderMemOps", OrigLoop->getStartLoc(),		DEBUG_TYPE, "CantReorderMemOps", OrigLoop->getStartLoc(),
OrigLoop->getHeader())		OrigLoop->getHeader())
<< "loop not vectorized: cannot prove it is safe to reorder "		<< "loop not vectorized: cannot prove it is safe to reorder "
"memory operations";		"memory operations";
});		});
LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");		LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return VectorizationFactor::Disabled();		return None;
}		}
}		}
return SelectedVF;		return SelectedVF;
}		}

void LoopVectorizationPlanner::setBestPlan(ElementCount VF, unsigned UF) {		void LoopVectorizationPlanner::setBestPlan(ElementCount VF, unsigned UF) {
LLVM_DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF		LLVM_DEBUG(dbgs() << "Setting best plan to VF=" << VF << ", UF=" << UF
<< '\n');		<< '\n');
▲ Show 20 Lines • Show All 2,085 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
// Use the planner for vectorization.		// Use the planner for vectorization.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM, IAI, PSE, Hints,		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM, IAI, PSE, Hints,
Requirements, ORE);		Requirements, ORE);

// Get user vectorization factor and interleave count.		// Get user vectorization factor and interleave count.
ElementCount UserVF = Hints.getWidth();		ElementCount UserVF = Hints.getWidth();
unsigned UserIC = Hints.getInterleave();		unsigned UserIC = Hints.getInterleave();

		GeneratedRTChecks Checks(*PSE.getSE(), DT, LI,
		F->getParent()->getDataLayout());
// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC);		Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC, Checks);

VectorizationFactor VF = VectorizationFactor::Disabled();		VectorizationFactor VF = VectorizationFactor::Disabled();
unsigned IC = 1;		unsigned IC = 1;

if (MaybeVF) {		if (MaybeVF) {
VF = *MaybeVF;		VF = *MaybeVF;
// Select the interleave count.		// Select the interleave count.
IC = CM.selectInterleaveCount(VF.Width, *VF.Cost.getValue());		IC = CM.selectInterleaveCount(VF.Width, *VF.Cost.getValue());
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	if (!VectorizeLoop && !InterleaveLoop) {
LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width		LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width
<< ") in " << DebugLocStr << '\n');		<< ") in " << DebugLocStr << '\n');
LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');		LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');
}		}

bool DisableRuntimeUnroll = false;		bool DisableRuntimeUnroll = false;
MDNode *OrigLoopID = L->getLoopID();		MDNode *OrigLoopID = L->getLoopID();
{		{
// Optimistically generate runtime checks. Drop them if they turn out to not
// be profitable. Limit the scope of Checks, so the cleanup happens
// immediately after vector codegeneration is done.
GeneratedRTChecks Checks(*PSE.getSE(), DT, LI,
F->getParent()->getDataLayout());
if (!VF.Width.isScalar() \|\| IC > 1)
Checks.Create(L, *LVL.getLAI(), PSE.getUnionPredicate());
LVP.setBestPlan(VF.Width, IC);		LVP.setBestPlan(VF.Width, IC);

using namespace ore;		using namespace ore;
if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop, then		// If we decided that it is not legal to vectorize the loop, then
// interleave it.		// interleave it.
InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,		InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
▲ Show 20 Lines • Show All 196 Lines • Show Last 20 Lines

llvm/test/LTO/X86/diagnostic-handler-remarks.ll

	; RUN: llvm-as < %s >%t.bc			; RUN: llvm-as < %s >%t.bc
	; PR21108: Diagnostic handlers get pass remarks, even if they're not enabled.			; PR21108: Diagnostic handlers get pass remarks, even if they're not enabled.

	; FIXME: Update checks for new pass manager.			; FIXME: Update checks for new pass manager.

	; Confirm that there are -pass-remarks.			; Confirm that there are -pass-remarks.
	; RUN: llvm-lto -use-new-pm=false \			; RUN: llvm-lto -use-new-pm=false \
				; RUN: -vectorize-memory-check-factor=0 \
	; RUN: -pass-remarks=inline \			; RUN: -pass-remarks=inline \
	; RUN: -exported-symbol _func2 -pass-remarks-analysis=loop-vectorize \			; RUN: -exported-symbol _func2 -pass-remarks-analysis=loop-vectorize \
	; RUN: -exported-symbol _main -o %t.o %t.bc 2>&1 \| \			; RUN: -exported-symbol _main -o %t.o %t.bc 2>&1 \| \
	; RUN: FileCheck %s -allow-empty -check-prefix=REMARKS			; RUN: FileCheck %s -allow-empty -check-prefix=REMARKS
	; RUN: llvm-nm %t.o \| FileCheck %s -check-prefix NM			; RUN: llvm-nm %t.o \| FileCheck %s -check-prefix NM

	; RUN: llvm-lto -use-new-pm=false \			; RUN: llvm-lto -use-new-pm=false \
				; RUN: -vectorize-memory-check-factor=0 \
	; RUN: -pass-remarks=inline -use-diagnostic-handler \			; RUN: -pass-remarks=inline -use-diagnostic-handler \
	; RUN: -exported-symbol _func2 -pass-remarks-analysis=loop-vectorize \			; RUN: -exported-symbol _func2 -pass-remarks-analysis=loop-vectorize \
	; RUN: -exported-symbol _main -o %t.o %t.bc 2>&1 \| \			; RUN: -exported-symbol _main -o %t.o %t.bc 2>&1 \| \
	; RUN: FileCheck %s -allow-empty -check-prefix=REMARKS_DH			; RUN: FileCheck %s -allow-empty -check-prefix=REMARKS_DH
	; RUN: llvm-nm %t.o \| FileCheck %s -check-prefix NM			; RUN: llvm-nm %t.o \| FileCheck %s -check-prefix NM

	; Confirm that -pass-remarks are not printed by default.			; Confirm that -pass-remarks are not printed by default.
	; RUN: llvm-lto -use-new-pm=false \			; RUN: llvm-lto -use-new-pm=false \
	▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll

; RUN: opt -loop-vectorize -mtriple=arm64-apple-iphoneos -S %s \| FileCheck %s		; RUN: opt -loop-vectorize -mtriple=arm64-apple-iphoneos -S %s \| FileCheck --check-prefix=CHECK --check-prefix=DEFAULT %s
		; RUN: opt -loop-vectorize -vectorize-memory-check-factor=1 -mtriple=arm64-apple-iphoneos -S %s \| FileCheck --check-prefix=CHECK --check-prefix=CUSTOM %s

; Tests for loops with large numbers of runtime checks. Check that loops are		; All of the loops here have sufficiently-large loop bodies,
; vectorized, if the loop trip counts are large and the impact of the runtime		; so the additional cost of the runtime memory checks is not too large,
; checks is very small compared to the expected loop runtimes.		; so we vectorize them.


; The trip count in the loop in this function is too to warrant large runtime checks.
; CHECK-LABEL: define {{.*}} @test_tc_too_small		; CHECK-LABEL: define {{.*}} @test_tc_too_small
; CHECK-NOT: vector.memcheck		; DEFAULT: vector.memcheck
; CHECK-NOT: vector.body		; DEFAULT: vector.body
		; CUSTOM-NOT: vector.memcheck
		; CUSTOM-NOT: vector.body
define void @test_tc_too_small(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2) {		define void @test_tc_too_small(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2) {
entry:		entry:
br label %loop		br label %loop

loop: ; preds = %bb54, %bb37		loop: ; preds = %bb54, %bb37
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv		%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv
%lv.1 = load i16, i16* %gep.1, align 2		%lv.1 = load i16, i16* %gep.1, align 2
Show All 32 Lines	loop: ; preds = %bb54, %bb37
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%cmp = icmp ult i64 %iv, 50		%cmp = icmp ult i64 %iv, 50
br i1 %cmp, label %loop, label %exit		br i1 %cmp, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

; FIXME
; The trip count in the loop in this function high enough to warrant large runtime checks.
; CHECK-LABEL: define {{.*}} @test_tc_big_enough		; CHECK-LABEL: define {{.*}} @test_tc_big_enough
; CHECK-NOT: vector.memcheck		; DEFAULT: vector.memcheck
; CHECK-NOT: vector.body		; DEFAULT: vector.body
		; CUSTOM-NOT: vector.memcheck
		; CUSTOM-NOT: vector.body
define void @test_tc_big_enough(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2) {		define void @test_tc_big_enough(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2) {
entry:		entry:
br label %loop		br label %loop

loop: ; preds = %bb54, %bb37		loop: ; preds = %bb54, %bb37
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv		%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv
%lv.1 = load i16, i16* %gep.1, align 2		%lv.1 = load i16, i16* %gep.1, align 2
Show All 39 Lines

llvm/test/Transforms/LoopVectorize/X86/runtime-limit.ll

; RUN: opt < %s -loop-vectorize -dce -instcombine -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-missed=loop-vectorize -S 2>&1 \| FileCheck %s -check-prefix=OVERRIDE		; RUN: opt < %s -loop-vectorize -vectorize-memory-check-factor=2 -dce -instcombine -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-missed=loop-vectorize -S 2>&1 \| FileCheck %s -check-prefix=CHECK
; RUN: opt < %s -loop-vectorize -pragma-vectorize-memory-check-threshold=6 -dce -instcombine -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-missed=loop-vectorize -S 2>&1 \| FileCheck %s		; RUN: opt < %s -loop-vectorize -vectorize-memory-check-factor=1 -dce -instcombine -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-missed=loop-vectorize -S 2>&1 \| FileCheck %s -check-prefix=OVERRIDE

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

target triple = "x86_64-unknown-linux"		target triple = "x86_64-unknown-linux"

; First loop produced diagnostic pass remark.		; First loop produced diagnostic pass remark.
;CHECK: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 2)		;CHECK: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 2)
; Second loop produces diagnostic analysis remark.		; Second loop produces diagnostic analysis remark.
;CHECK: remark: {{.*}}:0:0: loop not vectorized: cannot prove it is safe to reorder memory operations		;CHECK: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 1)

; First loop produced diagnostic pass remark.		; First loop produced diagnostic pass remark.
;OVERRIDE: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 2)		;OVERRIDE: remark: {{.*}}:0:0: loop not vectorized: cannot prove it is safe to reorder memory operations
; Second loop produces diagnostic pass remark.		; Second loop produces diagnostic pass remark.
;OVERRIDE: remark: {{.*}}:0:0: loop not vectorized: cannot prove it is safe to reorder memory operations		;OVERRIDE: remark: {{.*}}:0:0: loop not vectorized: cannot prove it is safe to reorder memory operations

; We are vectorizing with 6 runtime checks.		; We are vectorizing with 6 runtime checks.
;CHECK-LABEL: func1x6(		;CHECK-LABEL: func1x6(
;CHECK: <4 x i32>		;CHECK: <4 x i32>
;CHECK: ret		;CHECK: ret
;OVERRIDE-LABEL: func1x6(		;OVERRIDE-LABEL: func1x6(
;OVERRIDE: <4 x i32>		;OVERRIDE-NOT: <4 x i32>
;OVERRIDE: ret		;OVERRIDE: ret
define i32 @func1x6(i32* nocapture %out, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {		define i32 @func1x6(i32* nocapture %out, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {
entry:		entry:
br label %for.body		br label %for.body

for.body: ; preds = %for.body, %entry		for.body: ; preds = %for.body, %entry
%i.016 = phi i64 [ 0, %entry ], [ %inc, %for.body ]		%i.016 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds i32, i32* %A, i64 %i.016		%arrayidx = getelementptr inbounds i32, i32* %A, i64 %i.016
Show All 17 Lines	for.body: ; preds = %for.body, %entry
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret i32 undef		ret i32 undef
}		}

; We are not vectorizing with 12 runtime checks.		; We are not vectorizing with 12 runtime checks.
;CHECK-LABEL: func2x6(		;CHECK-LABEL: func2x6(
;CHECK-NOT: <4 x i32>		;CHECK: <4 x i32>
;CHECK: ret		;CHECK: ret
; We vectorize with 12 checks if a vectorization hint is provided.		; We vectorize with 12 checks if a vectorization hint is provided.
;OVERRIDE-LABEL: func2x6(		;OVERRIDE-LABEL: func2x6(
;OVERRIDE-NOT: <4 x i32>		;OVERRIDE-NOT: <4 x i32>
;OVERRIDE: ret		;OVERRIDE: ret
define i32 @func2x6(i32* nocapture %out, i32* nocapture %out2, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {		define i32 @func2x6(i32* nocapture %out, i32* nocapture %out2, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {
entry:		entry:
br label %for.body		br label %for.body
Show All 38 Lines