This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.cpp
2/2
LoopVectorizationPlanner.h
29/34
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
runtime-check-size-based-threshold.ll
-
sve-tail-folding-forced.ll
-
sve-tail-folding-unroll.ll
-
sve-tail-folding.ll
-
X86/
-
gather_scatter.ll
-
pointer-runtime-checks-unprofitable.ll
-
pr23997.ll
-
pr35432.ll
-
pr54634.ll
-
runtime-limit.ll

Differential D109368

[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
ClosedPublic

Authored by fhahn on Sep 7 2021, 8:35 AM.

Download Raw Diff

Details

Reviewers

rengolin
Ayal
gilr
hsaito
lebedev.ri
ebrevnov
dmgreen

Commits

rG644a965c1efe: [LV] Vectorize cases with larger number of RT checks, execute only if…

Summary

This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both IRSmk and srad are relatively short running, but the changes are far above the noise level for them on my benchmark system.

CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%

size regressions on AArch64 with -O3 are

MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Sep 7 2021, 8:35 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptSep 7 2021, 8:35 AM

fhahn requested review of this revision.Sep 7 2021, 8:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2021, 8:35 AM

Harbormaster completed remote builds in B122882: Diff 371090.Sep 7 2021, 8:36 AM

fhahn mentioned this in D109296: [LV] Improve inclusivity of vectorization.Sep 7 2021, 8:48 AM

ebrevnov mentioned this in D75981: [LV] Create RT checks once VF/IC are selected, track scalar cost..Sep 7 2021, 8:30 PM

ebrevnov mentioned this in D109444: [LV] Don't vectorize if we can prove RT + vector cost >= scalar cost (alt. version).Sep 8 2021, 10:08 AM

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

In D109368#2989858, @ebrevnov wrote:

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

(to make it more obvious, the alternative patch is https://reviews.llvm.org/D109444)

fhahn added a parent revision: D75981: [LV] Create RT checks once VF/IC are selected, track scalar cost..Sep 10 2021, 6:06 AM

I realized that doing similar thing in the cost model requires a bit of preparational work (about 6 patches). Due to that I think it may be reasonable to land this first. Please find my comments inlined. One thing I would like to ask you. In order to simplify merging with the mentioned 6 patches would be nice if you rebase your work on D109443 (hopefully can be landed quickly) and take D109444 (except line 10182) as part of this change.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7508	Would it be better to use !SelectedVF.Width.isScalar() instead. That will make it obvious that RTCost should be applied to vector loops only.
7509	In general, I would like us to agree on the strategy to follow when number of iterations is not known at compile time. Currently, I see the following inconsistency in use of getSmallBestKnownTC. On the one hand, it may return upper bound estimate (SE.getSmallConstantMaxTripCount) which may greatly overestimate real trip count. On the over hand, it returns None if SE.getSmallConstantMaxTripCount didn't manage to get an estimate and we will end up skipping the check entirely. I see 3 possible ways to follow in the case of unknow trip count: Don't check RT overhead if trip count is unknown at compile time. This is most conservative solution. Assume maximum possible number of iterations. In practice that will essentially give the same result as 1) (because RTCost / double(*ExpectedTC) would be less than 1 ). This is essentially what this patch does. In addition we would need to take max value for None as well Assume some fixed reasonable average number of iterations (may be extended to a more complex heuristic in future). While in theory 3) is more general and may give better estimate it mat require panful tuning. Given that I would vote for 1) at this moment as the most conservative. Note also regardless of chosen way to go we should decide how to categorize trip count deduced from profile. I personally think we should consider it same way as known trip count.
7512	In general case , total vector cost is "PrologCost + VectorCost(ExpectedTC/Width) + EpilogCost, where (ExpectedTC/Width) is an integer division. RTCost is just a part of prolog cost. I think it is worth mentioning. Then we should explain why/how it reduces to "RTCost + VectorCost (ExpectedTC/Width)". Thus we end up with the check "ScalarCost * ExpectedTC <= RTCost + VectorCost* (ExpectedTC/Width)". Division by ExpectedTC (using FP) makes sense to me and should give acceptable accuracy. Multiplication by 'Width' may give up to 'Width - 1' error. For that reason I would avoid doing multiplication by 'Width'. Thus we end up with "ScalarCost <= RTCost/(double)ExpectedTC + VectorCost/Width". There are still two cases not taken into account (foldTailByMasking() and equiresScalarEpilogue()) but I think it's OK for this type of check. Makes sense?
7521	I would suggest keeping result in a variable to avoid coping the (possibly non-trivial) formula.

In D109368#2989858, @ebrevnov wrote:

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

Thanks for taking a look!

I made a largish adjustment to the patch. It still uses the formula outlined in the original patch, but uses it differently: instead of computing the costs for known/expected trip counts it now computes the minimum trip count given the costs. This new minimum trip count can than be checked against the known/expected trip count. If there is no known/expected trip count, this minimum trip count is used for the minimum iteration check.

It also computes a second minimum to have a bound on the runtime overhead (if the checks fail) compared to the scalar loop. This should guard against failing runtime checks increasing the total runtime by more than a fraction of the scalar loop. (at the moment the fraction is hardcoded 1/10th of the total scalar loop cost, but I'll add an option for it if we converge)

In D109368#2997070, @ebrevnov wrote:

I realized that doing similar thing in the cost model requires a bit of preparational work (about 6 patches). Due to that I think it may be reasonable to land this first. Please find my comments inlined. One thing I would like to ask you. In order to simplify merging with the mentioned 6 patches would be nice if you rebase your work on D109443 (hopefully can be landed quickly) and take D109444 (except line 10182) as part of this change.

Given the large update I did not adjust it yet on D109443. I'd suggest to discuss moving the code into the cost-model separately in your patches. I think at the moment the cost-model is mostly concerned with picking the best vectorization factor and focuses on computing costs for different VFs. I am not sure what the trade-offs are to adding more VF independent cost-modeling there yet (drawbacks are adding even more global state to the cost model, increasing complexity). I assume the motivation for moving it to the cost model is to allow using it when deciding on whether to vectorize given ScalarEpilogueStatus? I think that's best discussed separate with focus on that case.

Herald added subscribers: bmahjour, dmgreen. · View Herald TranscriptSep 25 2021, 11:54 AM

Harbormaster completed remote builds in B125722: Diff 375060.Sep 25 2021, 11:55 AM

Matt added a subscriber: Matt.Sep 25 2021, 12:34 PM

FYI this patch can't be applied with arc easily:
error: llvm/test/Transforms/LoopVectorize/X86/pointer-runtime-checks-unprofitable.ll: does not exist in index

I have checked, and this is even more intrusive solution than the originally-disscussed design (D109296),
and it very successfully vectorizes the loops in question!

I must say, i really like this approach, since it is the only way to truly cover all the cases.
I basically had in mind something along the lines of "do we even need restrict the run-time checks",
i did not propose this as i thought it would be *too* radical.

I've gone over the math (both by hand and via sage, and it checks out)
As far as i'm concerned, this Looks Great.

@fhahn @Ayal thank you so much!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
193	Nit: what does `prof` mean? PGO profile? Profitable?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2943–2948	This looks like `std::max(step, MinProfTripCount)`? Might be worth adding a comment.
7528
7530–7531	I'm rather sure we can't :)
7533–7538	I agree that here the math checks out, the sign is correct.
7536	This is not `VecC`, it's `VecC/VF`. Not an error, but confusing.
7548–7552	So does this round up or down? Use `alignTo()`/`alignDown()`?

This revision is now accepted and ready to land.Sep 25 2021, 3:32 PM

Also, this patch should probably steal it's subject from D109296, the current one does not do it's justice.

ebrevnov added inline comments.Sep 26 2021, 11:19 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447	Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.
7531	The expression 'VecC imul (TC idiv VF)' is not associative since it involves an integer division (idiv). Thus it's not legal to simply replace it with 'TC fmul (VecC fdiv VF)'. Maximum possible error is up to 'VF'. In other words 'TC fmul (VecC fdiv VF)' - 'VecC imul (TC idiv VF)' <= VF. That means 'MinTC1' computed that way is an upper estimate of actual minimum. I don't see anything terribly bad in taking upper estimate but IMHO worth mentioning in comments.
7554	Please use alignTo() instead.

This sounds like a sensible general approach to take to me, from a cost point of view. It seems more like how I once heard the GCC vectorizer cost calculate described.

But there may be more work required on the costing calculations. It might be both under and over estimating the costs in places.
The first obvious problem I ran into was that the runtime checks it emits are not simplified prior to the costs being added for each instruction. The overflow checks seem to create "umul_with_overflow(|step|, TC). But the step is commonly 1 so given half a chance the instruction will be simplified away.
It looks like there were other inefficiencies with the values created too, with select false, x, y or or x, false being added to the cost. But the umul_with_overflow(1, TC) was getting a pretty high cost.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447	Can this have a default value, to prevent the need for multiple constructors?
7517	This should be `RtC + VecC * floor(TC / VF) + EpiC` or `RtC + VecC * ceil(TC / VF)` when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing. It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed). And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.
7533	Perhaps add some more assumptions, like the ones mentioned above and that (ScalarC - (VecC / VF)) > 0 from other profitability checks.

In D109368#3022623, @fhahn wrote:

Given the large update I did not adjust it yet on D109443. I'd suggest to discuss moving the code into the cost-model separately in your patches.

Sure let's discuss move to cost model separately. Having sad that first four patches in the series (including D109443) are general improvements independent of the change of cost model itself.
Of cause, there is no strong dependency and D109443 should not block this one.

ebrevnov added inline comments.Sep 27 2021, 12:33 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7517	This should be `RtC + VecC * floor(TC / VF) + EpiC` or `RtC + VecC * ceil(TC / VF)` when folding the tail (assuming there is no epilogue then). Since we already compute upper estimate for MinTC1 it doesn't seem to be necessary to do additional adjustment when folding tail. But we probably want/need to do an adjustment for "requiresScalarEpilogue" case.
7533	It's totally fine for '(ScalarC - (VecC / VF))' to be negative since we take max(MinTC1, MinTC2) and MinTC2>=0

Generally LGTM (except few mentioned nits)

Can you please provide an analysis of the code size impact this has? LLVM is already quite bad at unrolling/vectorizing too aggressively for non-benchmark code (which does not have hot loops) and I'm somewhat concerned this will make the situation even worse. Please do keep in mind that code size has important performance effects on icache and itlb misses, and just jumping over the code doesn't make that cost disappear.

In D109368#3022623, @fhahn wrote:

I am not sure what the trade-offs are to adding more VF independent cost-modeling there yet (drawbacks are adding even more global state to the cost model, increasing complexity).

Right, while cost of runtime-checks (at the moment) is VF independent we can decide not to do it inside cost model. Though conceptually it looks much more solid design when CostModel is responsible for the cost based decisions. And total code complexity will be even less than doing cost modeling in different places.

I assume the motivation for moving it to the cost model is to allow using it when deciding on whether to vectorize given ScalarEpilogueStatus?

Right, cost of epilog depends on TC & VF. No excuse to do it outside cost model :-)

I think that's best discussed separate with focus on that case.

Sure can do it later. But will have to do a restructure first (essentially move existing code inside CM) otherwise there will be nasty code duplication.

fhahn mentioned this in rG4b581e87df6b: [LV] Add tests where rt checks may make vectorization unprofitable..Sep 27 2021, 2:34 AM

Updated to address the inline comments. Thanks everyone! I hope I didn't miss any (apologies if I did).

In D109368#3022743, @lebedev.ri wrote:

Also, this patch should probably steal it's subject from D109296, the current one does not do it's justice.

Yep I need to update the title & description! I'll do that once I collected a bit more supporting data.

In D109368#3023635, @dmgreen wrote:

This sounds like a sensible general approach to take to me, from a cost point of view. It seems more like how I once heard the GCC vectorizer cost calculate described.

But there may be more work required on the costing calculations. It might be both under and over estimating the costs in places.
The first obvious problem I ran into was that the runtime checks it emits are not simplified prior to the costs being added for each instruction. The overflow checks seem to create "umul_with_overflow(|step|, TC). But the step is commonly 1 so given half a chance the instruction will be simplified away.
It looks like there were other inefficiencies with the values created too, with select false, x, y or or x, false being added to the cost. But the umul_with_overflow(1, TC) was getting a pretty high cost.

Thanks for raising this point! As you mentioned, we should try to make sure the code we emit for the runtime checks is as close to the final version as possible. The or x, false case should be easy to fix. For the umul_with_overflow I'd need to take a closer look to see where the best place to improve that would be.

In D109368#3023702, @nikic wrote:

Can you please provide an analysis of the code size impact this has? LLVM is already quite bad at unrolling/vectorizing too aggressively for non-benchmark code (which does not have hot loops) and I'm somewhat concerned this will make the situation even worse. Please do keep in mind that code size has important performance effects on icache and itlb misses, and just jumping over the code doesn't make that cost disappear.

I am still collecting the relevant data and I'll share it here once I have more details. On a high level the impact on the number of vectorized loops is relatively small on a large set of benchmarks (SPEC2006/SPEC2017/MultiSource) with a ~1% increase in vectorized loops on ARM64 with -O3.

So far the top size increase seem to be in smaller benchmarks. I need to take a closer look at IRSmk and smg2000 in particular. First inspection of IRSmk shows that is mostly consists of 2-3 large loops for which we did not generate runtime checks for so far, but do with that patch. Still need to check why the increase is so big.

test-suite...s/ASC_Sequoia/IRSmk/IRSmk.test   3240.00    26668.00   723.1%
test-suite...CI_Purple/SMG2000/smg2000.test   154724.00  403628.00  160.9%
test-suite...chmarks/Rodinia/srad/srad.test   4332.00    5736.00    32.4%
test-suite...Source/Benchmarks/sim/sim.test   13628.00   16660.00   22.2%
test-suite...pplications/oggenc/oggenc.test   202936.00  232608.00  14.6%
test-suite...oxyApps-C/miniGMG/miniGMG.test   51588.00   57792.00   12.0%
test-suite...arks/mafft/pairlocalalign.test   356532.00  397024.00  11.4%
test-suite...pps-C/SimpleMOC/SimpleMOC.test   30488.00   33436.00    9.7%
test-suite...oxyApps-C/miniAMR/miniAMR.test   51744.00   55580.00    7.4%
test-suite...rks/FreeBench/pifft/pifft.test   57368.00   59032.00    2.9%
test-suite...8.imagick_r/538.imagick_r.test   1484356.00 1518020.00  2.3%
test-suite...yApps-C++/PENNANT/PENNANT.test   105064.00  107004.00   1.8%
test-suite...rks/tramp3d-v4/tramp3d-v4.test   776340.00  788504.00   1.6%
test-suite...6/464.h264ref/464.h264ref.test   614848.00  623200.00   1.4%

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
193	Changed to `MinProfitableTripCount` and add a comment
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447	Can this have a default value, to prevent the need for multiple constructors? unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor. Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument. That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up
7517	This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing. I added a statement below about the fact that the computations are performed on doubles and later rounded up, giving an upper bound estimate as @ebrevnov suggested. Do you think that's sufficient? It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed). Yep, that's a fundamental assumption at the moment. Unfortunately I cannot think of a good way to estimate the probability of the checks passing. If we assign a fixed probability per runtime check we are likely ending up with a hard limit like we have at the moment, just expressed differently. The main motivation of `MinTC2` below is to introduce a limit on the impact of a (large) number of runtime checks. The main goal is preventing evaluating large runtime checks for short running loops (at the moment we allow increasing total runtime by 10% due to failing runtime checks, but this could also be lower). While there might be additional cases where failing runtime checks cause increase in runtime, the same problem already exists even with the hard-coded limit we have at the moment. We could also change to way we emit runtime checks slightly and break them up across multiple blocks with earlier exits to increase the chances we do not have to evaluate all runtime checks if some fail. And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops. Agreed, this can be an unfortunate side effect. Again, this is a problem we are already hitting and this patch will add a bit more vectorized loops. But I think in general the impact on the number of loops vectorized with this patch should be relatively small (forSPEC2006/SPEC2017/MultiSource ~1% more loops are vectorized). And I think unfortunately there's not much we can do to avoid this check in general. One follow-up I think that becomes important is to make sure that we try to use PGO to detect cases where we create dead vector loops and skip vectorizing them.
7531	Thanks, I tried to update the comment to make this clearer.
7533	I think it should not really matter as @ebrevnov said, but it being negative may yield interesting cases to check the cost-modeling.
7536	Changed to `VecCOverVF`.
7548–7552	Ah that was the one I was looking for! updated thanks!
7554	updated, thanks!

Harbormaster completed remote builds in B126171: Diff 375681.Sep 28 2021, 1:23 PM

Thank you, still looks great!

dmgreen added inline comments.Sep 29 2021, 1:49 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447	unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor. I was expecting it to use `= ElementCount()`, but this sounds OK too. Is the InnerLoopUnroller value deliberately 1, or would passing zero be better? I imagine it doesn't make much difference in practice.
7517	As far as I understand (correct me if I'm wrong!) we are essentially changing from code that looked like: if (N < VF) { if (!runtimechecks) goto scalar loop vector loop; n -= VF scalar loop To the same code with, but with a different initial guard value and potentially more runtime checks in places: if (N < MinProfitableTripCount) { if (!runtimechecks) goto scalar loop vector loop; n -= VF scalar loop That means that if we under-estimate MinProfitableTripCount we go into the runtime checks/vector loop, potentially executing a lot of expensive runtime checks where it is not profitable. If we _over_ estimate the MinProfitableTripCount then at runtime we will not execute the vector code, falling back to the scalar loop. So we have generated larger/less efficient scalar code that then never executes the vector part, even if it would be profitable to do so. So we end up in the unfortunate place where either over or under estimating the cost can lead to inefficiencies. I'm not too worried about the details here. They sounds fine for the most part so long as they are close enough. I'm more worried about the cost of the runtime checks being over-estimated due to them being unsimplified prior to costing. I think that is where the worst regressions I am seeing from this patch come from. Loops where vector code was previously generated and executed are now skipped over. Unfortunately loops with lowish trip counts are common in a lot of code :) The code in LoopVectorizationCostModel::isMoreProfitable already talks about the cost in terms of `PerIterationCostceil(TripCount/VF)` vs `PerIterationCostfloor(TC/VF)` though, and I would recommend describing things in the same way here, explaining that `RtC + VecC * (TC / VF) + EpiC` is a simplification of that.

In D109368#3028466, @fhahn wrote:

That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up

There is no cost associated with runtime checks for epilogue vectorization. Thus we should simply initialize 'MinProfitableTripCount' with 'unset' value in Epilogue Vectorizer.

ebrevnov added inline comments.Oct 1 2021, 12:04 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
458	IHMO we better keep semantic of 'MinProfitableTripCount' and not change it's value to something not coming from profitability considerations. Can easily do required adjustments at use site(s).

Just a rebase after avoiding emitting the dummy AND x, true as part of the runtime checks in e844f05397b7.

The changes to llvm/test/Transforms/LoopVectorize/ARM/mve-qabs.ll are gone now.

Harbormaster completed remote builds in B129408: Diff 380490.Oct 18 2021, 11:50 AM

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

In D109368#3085804, @lebedev.ri wrote:

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

Can we make it so that this code doesn't produce a umul_with_overflow if the step is 1?
https://github.com/llvm/llvm-project/blob/8e4c806ed5a481e4d2163c8330f3c3c024d61a36/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp#L2501

(We may need to improve the costmodel for it under AArch64 too, but I've not looked into that quite yet. I'm not sure if @fhahn is planning anything like that either?)

In D109368#3089615, @dmgreen wrote:

In D109368#3085804, @lebedev.ri wrote:

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

Can we make it so that this code doesn't produce a umul_with_overflow if the step is 1?
https://github.com/llvm/llvm-project/blob/8e4c806ed5a481e4d2163c8330f3c3c024d61a36/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp#L2501

(We may need to improve the costmodel for it under AArch64 too, but I've not looked into that quite yet. I'm not sure if @fhahn is planning anything like that either?)

Can you point me at the test where that happens?

Can you point me at the test where that happens?

Hmm I don't know if there is a test. This should hopefully show it: https://godbolt.org/z/6Th4o1s5K

If you print the costs for the runtime checks, you can see they are unsimplified, with the umul being the largest part of the cost:

Cost of 0 for RTCheck   %4 = trunc i64 %0 to i32                                              
Cost of 10 for RTCheck   %mul31 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 %4)
Cost of 0 for RTCheck   %mul.result = extractvalue { i32, i1 } %mul31, 0                      
Cost of 0 for RTCheck   %mul.overflow = extractvalue { i32, i1 } %mul31, 1                    
Cost of 1 for RTCheck   %5 = add i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %6 = sub i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %7 = icmp ugt i32 %6, %2                                              
Cost of 1 for RTCheck   %8 = icmp ult i32 %5, %2                                              
Cost of 1 for RTCheck   %9 = select i1 false, i1 %7, i1 %8                                    
Cost of 1 for RTCheck   %10 = icmp ugt i64 %0, 4294967295                                     
Cost of 1 for RTCheck   %11 = or i1 %9, %10                                                   
Cost of 1 for RTCheck   %12 = or i1 %11, %mul.overflow                                        
Cost of 1 for RTCheck   %13 = or i1 false, %12                                                
LV: Minimum required TC for runtime checks to be profitable:28

I'm not sure if they should be simplified by the builder during construction, simplified prior to costing or the code to create them needs to be more precise.

lebedev.ri mentioned this in rGab1dbcecd6f0: [IR] `IRBuilderBase::CreateSelect()`: if cond is a constant i1, short-circuit.Oct 27 2021, 8:03 AM

lebedev.ri mentioned this in rGf3df87d57e09: [IR] `IRBuilderBase::CreateOr()`: fix short-circuiting for constant on LHS.

lebedev.ri mentioned this in rG749581d21f2b: [IR] `IRBuilderBase::CreateAnd()`: fix short-circuiting for constant on LHS.

lebedev.ri mentioned this in rGf3190dedeef9: [IR] `IRBuilderBase::CreateAnd()`: short-circuit `x & 0` --> `0`.

lebedev.ri mentioned this in rG156f10c840a0: [IR] `SCEVExpander::generateOverflowCheck()`: short-circuit….Oct 27 2021, 9:46 AM

lebedev.ri mentioned this in rGcb90e5356ac1: [IR] `IRBuilderBase::CreateAdd()`: short-circuit `x + 0` --> `x`.Oct 27 2021, 11:35 AM

In D109368#3089809, @dmgreen wrote:

Can you point me at the test where that happens?

Hmm I don't know if there is a test. This should hopefully show it: https://godbolt.org/z/6Th4o1s5K

If you print the costs for the runtime checks, you can see they are unsimplified, with the umul being the largest part of the cost:

Cost of 0 for RTCheck   %4 = trunc i64 %0 to i32                                              
Cost of 10 for RTCheck   %mul31 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 %4)
Cost of 0 for RTCheck   %mul.result = extractvalue { i32, i1 } %mul31, 0                      
Cost of 0 for RTCheck   %mul.overflow = extractvalue { i32, i1 } %mul31, 1                    
Cost of 1 for RTCheck   %5 = add i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %6 = sub i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %7 = icmp ugt i32 %6, %2                                              
Cost of 1 for RTCheck   %8 = icmp ult i32 %5, %2                                              
Cost of 1 for RTCheck   %9 = select i1 false, i1 %7, i1 %8                                    
Cost of 1 for RTCheck   %10 = icmp ugt i64 %0, 4294967295                                     
Cost of 1 for RTCheck   %11 = or i1 %9, %10                                                   
Cost of 1 for RTCheck   %12 = or i1 %11, %mul.overflow                                        
Cost of 1 for RTCheck   %13 = or i1 false, %12                                                
LV: Minimum required TC for runtime checks to be profitable:28

I'm not sure if they should be simplified by the builder during construction, simplified prior to costing or the code to create them needs to be more precise.

So good and bad news. While the @llvm.umul.with.overflow case
was straight-forward (done in 156f10c840a0), there is still a significant number
of inefficiencies in the IR for these checks. I wasn't particularly looking forward to
arriving at the answer, but it is pretty obvious: if we really want to minimize
the estimated cost for these checks, we have to run instsimplify (or even instcombine)
on them first. The caveat here is that we first need to defuse SCEVExpanderCleaner,
because simplification will lead to dead instructions, and leaving them will again
lead to artificial cost. I feel like that is an improvement that is best done after
this change itself, even though i'm not quite sure yet how to approach it.

We are seeing these changes (ab1dbcecd6f0a to f3190dedeef9) breaks multiple Polly::CodeGen tests.
E.g. Polly :: CodeGen/aliasing_different_pointer_types.ll failed with message:

Script:
--
: 'RUN: at line 1';   /b/s/w/ir/x/w/staging/llvm_build/bin/opt  -polly-process-unprofitable  -polly-remarks-minimal  -polly-use-llvm-names  -polly-import-jscop-dir=/b/s/w/ir/x/w/llvm-project/polly/test/CodeGen  -polly-codegen-verify  -polly-codegen -S < /b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll | /b/s/w/ir/x/w/staging/llvm_build/bin/FileCheck /b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll
--
Exit Code: 1

Command Output (stderr):
--
/b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll:18:10: error: CHECK: expected string not found in input
; CHECK: %[[orAndTrue:[._a-zA-Z0-9]]] = and i1 true, %[[le1OrLe2]]
         ^
<stdin>:20:19: note: scanning from here
 %6 = or i1 %2, %5
                  ^
<stdin>:20:19: note: with "le1OrLe2" equal to "6"
 %6 = or i1 %2, %5
                  ^
<stdin>:40:2: note: possible intended match here
 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
 ^

Input file: <stdin>
Check file: /b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            .
            .
            .
           15:  %polly.access.A1 = getelementptr double*, double** %A, i64 1024 
           16:  %polly.access.B2 = getelementptr float*, float** %B, i64 0 
           17:  %3 = ptrtoint double** %polly.access.A1 to i64 
           18:  %4 = ptrtoint float** %polly.access.B2 to i64 
           19:  %5 = icmp ule i64 %3, %4 
           20:  %6 = or i1 %2, %5 
check:18'0                       X error: no match found
check:18'1                         with "le1OrLe2" equal to "6"
           21:  br i1 %6, label %polly.start, label %for.cond.pre_entry_bb 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           22:  
check:18'0     ~
           23: for.cond.pre_entry_bb: ; preds = %polly.split_new_and_old 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           24:  br label %for.cond 
check:18'0     ~~~~~~~~~~~~~~~~~~~~
           25:  
check:18'0     ~
            .
            .
            .
           35:  %arrayidx2 = getelementptr inbounds double*, double** %A, i64 %indvars.iv 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           36:  store double* %tmp1, double** %arrayidx2, align 8 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           37:  br label %for.inc 
check:18'0     ~~~~~~~~~~~~~~~~~~~
           38:  
check:18'0     ~
           39: for.inc: ; preds = %for.body 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           40:  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:18'2      ?                                                  possible intended match
           41:  br label %for.cond 
check:18'0     ~~~~~~~~~~~~~~~~~~~~
           42:  
check:18'0     ~
           43: polly.merge_new_and_old: ; preds = %polly.exiting, %for.cond 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           44:  br label %for.end 
check:18'0     ~~~~~~~~~~~~~~~~~~~
           45:  
check:18'0     ~
            .
            .
            .
>>>>>>
--

failed build: https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8832202202604530161/overview

Could you take a look? If it takes a long time to fix, could you revert your change first?

In D109368#3091746, @haowei wrote:

We are seeing these changes (ab1dbcecd6f0a to f3190dedeef9) breaks multiple Polly::CodeGen tests.

Fixed, thank you.

In D109368#3089615, @dmgreen wrote:

In D109368#3085804, @lebedev.ri wrote:

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

Can we make it so that this code doesn't produce a umul_with_overflow if the step is 1?
https://github.com/llvm/llvm-project/blob/8e4c806ed5a481e4d2163c8330f3c3c024d61a36/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp#L2501

(We may need to improve the costmodel for it under AArch64 too, but I've not looked into that quite yet. I'm not sure if @fhahn is planning anything like that either?)

I just rebased the patches after some of the recent changes, but there are a ton of crashes now when cleaning up after the SCEVExpander. I need to investigate that first, before collecting a new set of numbers.

rebase

Harbormaster completed remote builds in B131643: Diff 383670.Oct 31 2021, 12:17 PM

In D109368#3099132, @fhahn wrote:

I just rebased the patches after some of the recent changes, but there are a ton of crashes now when cleaning up after the SCEVExpander. I need to investigate that first, before collecting a new set of numbers.

The latest rebase should resolve the crashes!

I also collected a first set of numbers for AArch64 with -O3 for Geekbench and SPEC2017. So far it looks like there is only one notable change: 544.nab_r regressed by 18%. So overall not too bad, but this one needs investigating. If anybody would be able to collect numbers for other platforms/benchmarks, that would also be extremely helpful.

Harbormaster completed remote builds in B135276: Diff 388715.Nov 20 2021, 10:24 AM

Rebased, removed some code that became dead and also introduced a new cut-off on the number of runtime checks. This cutoff is there to only control compile-time. This fixes some excessive compile-time regressions, especially in mafft (was +15%).

While the cutoff is not ideal, I think we have to accept a bound on the number of runtime checks, because in degenerate cases vectorization can add a lot of additional code. For one ccase in mafft, vectorizing a loop with 325 runtime checks caused a 200% compile-time regression for that file. Note that the choosen cutoff is already higher than the previous pragma threshold.

Compile-time impact with this patch : http://llvm-compile-time-tracker.com/compare.php?from=05ce750b7968c548cf10a5f1413cf5aac3f1b083&to=f2093a608e33dc48dc2b8ebbbb1d0cd45b9bf6e7&stat=instructions

NewPM-O3: +0.25%
NewPM-ReleaseThinLTO: +0.18%
NewPM-ReleaseLTO-g: +0.18%

Harbormaster completed remote builds in B141342: Diff 397084.Jan 3 2022, 9:27 AM

In D109368#3144716, @fhahn wrote:

In D109368#3099132, @fhahn wrote:

I also collected a first set of numbers for AArch64 with -O3 for Geekbench and SPEC2017. So far it looks like there is only one notable change: 544.nab_r regressed by 18%. So overall not too bad, but this one needs investigating. If anybody would be able to collect numbers for other platforms/benchmarks, that would also be extremely helpful.

I think I tracked down the remaining regression to the SCEVExpander creating very bad code when expanding SCEV predicates for certain cases (like @dmgreen mentioned earlier). I'll put up a few small patches for SCEVExpander to improve things starting with D116696. After that, I think the patch should be good to go both in terms of compile-time and runtime perf.

fhahn mentioned this in D102834: [SLP] Implement initial memory versioning..Jan 27 2022, 8:12 AM

Good to go now?

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 7:56 AM

Rebased

In D109368#3429586, @xbolva00 wrote:

Good to go now?

Basically yes from my prespective, although we should wait for D122126 and maybe D119078 to land. We should definitely make sure that there's a substantial gap between this patch and D119078, because both have the potential to shake things up quite a bit.

Harbormaster completed remote builds in B158022: Diff 420560.Apr 5 2022, 9:46 AM

Rebased on top of the other changes in this area (D122126, D119078). Those patches landed a while ago, so I think now would be a good time to move fowrad with this. Ping :)

Herald added a subscriber: zzheng. · View Herald TranscriptMay 25 2022, 9:30 AM

Harbormaster completed remote builds in B166296: Diff 432026.May 25 2022, 9:31 AM

dmgreen added inline comments.May 26 2022, 2:19 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1973	Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost? It appears that the cost of a vscale is coming through as 10, where it should be 1. I'm not sure if any of the other processing done in CM.getInstructionCost is very useful for the scalar runtime checks? I've added some quick tests (rG75631438e333) to show that the cost should be 1, it is just not treated as a "vectorizable intrinsic" by CM, so given the generic cost of a call.
1980	On a related note, can we get this to print the costs of each of the instructions in the runtime checks? It is useful for debugging when the numbers are incorrect. Otherwise at the moment I believe it just prints the final MinProfitableTripCount without any explanation of how it got there.
10253	Is it worth making "100" a compiler option, so that it is not hardcoded? Could it reuse VectorizeMemoryCheckThreshold, even if it is a different unit?

Address latest comments, thanks!

Harbormaster completed remote builds in B166636: Diff 432540.May 27 2022, 6:32 AM

fhahn marked 2 inline comments as done.May 27 2022, 6:34 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1973	Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost? Good point, as we are only interested in the scalar cost it should be sufficient to use TTI. Updated, thanks!
1980	I update dthe code to print instruction costs here. An interesting result is that it seems our cost estimates for GEPs currently always assume it is used in instructions where constant can be folded via the addressing mode. This is not true when using the pointers in compares as we do here. So the cost is slightly underestimated.
10253	Good point, I updated the code to re-use the option.

Thanks. No further objections from me. LGTM

The patch title and description could use an update. It currently sounds like this limits the amount of vectorization, while in practice this makes vectorization much more aggressive (right?)

Just to double check, now that an additional limit has been introduced, did the large code size regressions go away?

fhahn retitled this revision from [LV] Don't vectorize if we can prove RT + vector cost >= scalar cost. to [LV] Vectorize cases with larger number of RT checks, execute only if profitable..Jun 3 2022, 3:02 AM

fhahn edited the summary of this revision. (Show Details)

Herald added a subscriber: kristof.beyls. · View Herald TranscriptJun 3 2022, 3:02 AM

In D109368#3545453, @nikic wrote:

The patch title and description could use an update. It currently sounds like this limits the amount of vectorization, while in practice this makes vectorization much more aggressive (right?)

Thanks, I just updated the title & description!

Just to double check, now that an additional limit has been introduced, did the large code size regressions go away?

Mostly yes, I updated the description with size numbers for AArch64 with -O3 (should be the same configuration used for the earlier numbers). There are still a few increases, but much less severe ones than originally. Also that's out of 237 binaries. (SPEC2006, SPEC2017rate and MultiSource)

The patch should not be blocked by other patches, or? or any blocker?

In D109368#3576451, @xbolva00 wrote:

The patch should not be blocked by other patches, or? or any blocker?

No, it should be all good to go now. I am planning on landing the earlier patch in the chain and then this one relatively soon.

Harbormaster completed remote builds in B169840: Diff 436937.Jun 14 2022, 2:14 PM

Another rebase just before landing.

Harbormaster completed remote builds in B173554: Diff 442086.Jul 4 2022, 6:55 AM

This revision was landed with ongoing or failed builds.Jul 4 2022, 7:12 AM

Closed by commit rG644a965c1efe: [LV] Vectorize cases with larger number of RT checks, execute only if… (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rG644a965c1efe: [LV] Vectorize cases with larger number of RT checks, execute only if….

Looks like this breaks check-clang: http://45.33.8.238/linux/80254/step_7.txt

In D109368#3628539, @thakis wrote:

Looks like this breaks check-clang: http://45.33.8.238/linux/80254/step_7.txt

Thanks, should be fixed by 9eb65727861166543

Heads-up: I think this patch caused a mis-compile that's causing some test in Tenserflow to fail. We're still confirming it and working on a reproducer.

In D109368#3636261, @asmok-g wrote:

Heads-up: I think this patch caused a mis-compile that's causing some test in Tenserflow to fail. We're still confirming it and working on a reproducer.

I got it down to this sample:

void *memmove(void * destination, const void * source, unsigned long num);
void f(char *s, char *d, int g) {
  while (g--) {
    memmove(d, s, 4);
    d += 4;
  }
}

When compiled with --target=x86_64--linux-gnu -O2, before and after this commit, the resulting assembly differs in a way that seems wrong to me:

@@ -1,100 +1,100 @@
        .text
        .file   "input.i"
        .globl  f                               # -- Begin function f
        .p2align        4, 0x90
        .type   f,@function
 f:                                      # @f
        .cfi_startproc
 # %bb.0:
                                         # kill: def $edx killed $edx def $rdx
        testl   %edx, %edx
        je      .LBB0_16
 # %bb.1:
        leal    -1(%rdx), %r8d
-       cmpl    $7, %r8d
+       cmpl    $15, %r8d
        jb      .LBB0_2
 # %bb.3:
        leaq    4(%rdi), %rax
        cmpq    %rsi, %rax
        jbe     .LBB0_6
 # %bb.4:
        leaq    (%rsi,%r8,4), %rax
        addq    $4, %rax
        cmpq    %rdi, %rax
        jbe     .LBB0_6
 .LBB0_2:
        movq    %rsi, %rax
 .LBB0_9:
        leal    -1(%rdx), %r8d
        testb   $7, %dl
        je      .LBB0_13
 # %bb.10:
        movl    %edx, %r9d
        andl    $7, %r9d
        xorl    %esi, %esi
        .p2align        4, 0x90
 .LBB0_11:                               # =>This Inner Loop Header: Depth=1
        movl    (%rdi), %ecx
        movl    %ecx, (%rax)
        addq    $4, %rax
        incq    %rsi
        cmpl    %esi, %r9d
        jne     .LBB0_11
 # %bb.12:
        subl    %esi, %edx
 .LBB0_13:
        cmpl    $7, %r8d
        jb      .LBB0_16
 # %bb.14:
        movl    %edx, %ecx
        xorl    %edx, %edx
        .p2align        4, 0x90
 .LBB0_15:                               # =>This Inner Loop Header: Depth=1
        movl    (%rdi), %esi
        movl    %esi, (%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 4(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 8(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 12(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 16(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 20(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 24(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 28(%rax,%rdx,4)
        addq    $8, %rdx
        cmpl    %edx, %ecx
        jne     .LBB0_15
        jmp     .LBB0_16
 .LBB0_6:
        incq    %r8
        movq    %r8, %r9
        andq    $-8, %r9
        subl    %r9d, %edx
        leaq    (%rsi,%r9,4), %rax
        movd    (%rdi), %xmm0                   # xmm0 = mem[0],zero,zero,zero
        pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]
        xorl    %ecx, %ecx
        .p2align        4, 0x90
 .LBB0_7:                                # =>This Inner Loop Header: Depth=1
        movdqu  %xmm0, (%rsi,%rcx,4)
        movdqu  %xmm0, 16(%rsi,%rcx,4)
        addq    $8, %rcx
        cmpq    %rcx, %r9
        jne     .LBB0_7
 # %bb.8:
        cmpq    %r9, %r8
        jne     .LBB0_9
 .LBB0_16:
        retq
 .Lfunc_end0:
        .size   f, .Lfunc_end0-f
        .cfi_endproc
                                         # -- End function
-       .ident  "clang version google3-trunk (aa78c5298ea37f2ca8150dc0a1c880be7ec438f4)"
+       .ident  "clang version google3-trunk (644a965c1efef68f22d9495e4cefbb599c214788)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig

In D109368#3636302, @alexfh wrote:

When compiled with --target=x86_64--linux-gnu -O2, before and after this commit, the resulting assembly differs in a way that seems wrong to me:

After reading the description of the commit I'm not sure about this being wrong, but this sort of a change has definitely caused a difference in the behavior of some numpy C code used by the tensorflow test @asmok-g mentioned.

In D109368#3636313, @alexfh wrote:

In D109368#3636302, @alexfh wrote:

When compiled with --target=x86_64--linux-gnu -O2, before and after this commit, the resulting assembly differs in a way that seems wrong to me:

After reading the description of the commit I'm not sure about this being wrong, but this sort of a change has definitely caused a difference in the behavior of some numpy C code used by the tensorflow test @asmok-g mentioned.

I had a look at the example, but I don't think the patch is at fault here directly. The only difference for the example is that the vector loop is only executed if the loop execute 16 or more iterations vs 8 or more before. This shouldn't impact correctness, unless the code path for the scalar loop is mis-compiled.

Here's the only IR change:

<   %min.iters.check = icmp ult i32 %0, 7
---
>   %min.iters.check = icmp ult i32 %0, 15

Is it possible that the reproducer has been reduced too far? Does it work as expected if vectorization is disabled for the loop via #pragma clang loop vectorize(enable) / #pragma clang loop interleave(enable)?

In D109368#3637171, @fhahn wrote:

Is it possible that the reproducer has been reduced too far?

Yes, this was the case. And when I inspected the original code closer, I found a problem with the C code and a problem with the test using it. Thus, false alarm here.

By the way, this commit seems to regress the llvm_singlesource / Misc_oourafft benchmark by ~8% on multiple configurations. Not sure what the right tradeoff here is and how representative is the benchmark of real-life workloads, but you might want to look at this.

In D109368#3638358, @alexfh wrote:

By the way, this commit seems to regress the llvm_singlesource / Misc_oourafft benchmark by ~8% on multiple configurations. Not sure what the right tradeoff here is and how representative is the benchmark of real-life workloads, but you might want to look at this.

By "multiple configurations" I meant multiple microarchitectures and optimization modes (-O3 and FDO).

Hi @fhahn, I believe this patch breaks the AArch64/SVE buildbots. You can see the affect on https://lab.llvm.org/buildbot/#/builders/197/builds/2185. It looks like we were unlucky with the buildbot run that actually contained your patch because this failed for what looks like a temporary CI issue, but all the runs after this point show functional regressions when running LNT. I've manually checked your commit and the one before yours to confirm this is when things start to fail.

In D109368#3653700, @paulwalker-arm wrote:

Hi @fhahn, I believe this patch breaks the AArch64/SVE buildbots. You can see the affect on https://lab.llvm.org/buildbot/#/builders/197/builds/2185. It looks like we were unlucky with the buildbot run that actually contained your patch because this failed for what looks like a temporary CI issue, but all the runs after this point show functional regressions when running LNT. I've manually checked your commit and the one before yours to confirm this is when things start to fail.

Thanks for the heads-up! Would it be possible to share a preprocessed file that gets miscompiled or an LLVM IR function that shows the issue? Otherwise I won't be able to reproduce/investigate the failure as I do not have accesses to SVE hardware.

In D109368#3653703, @fhahn wrote:

In D109368#3653700, @paulwalker-arm wrote:

Hi @fhahn, I believe this patch breaks the AArch64/SVE buildbots. You can see the affect on https://lab.llvm.org/buildbot/#/builders/197/builds/2185. It looks like we were unlucky with the buildbot run that actually contained your patch because this failed for what looks like a temporary CI issue, but all the runs after this point show functional regressions when running LNT. I've manually checked your commit and the one before yours to confirm this is when things start to fail.

Thanks for the heads-up! Would it be possible to share a preprocessed file that gets miscompiled or an LLVM IR function that shows the issue? Otherwise I won't be able to reproduce/investigate the failure as I do not have accesses to SVE hardware.

The issue is easily reproduced via user mode qemu (in my case I did this on x86) if that helps? Otherwise I can see about getting a reproducer but that'll take some time so can you revert the patch in the interim?

In D109368#3653710, @paulwalker-arm wrote:

The issue is easily reproduced via user mode qemu (in my case I did this on x86) if that helps? Otherwise I can see about getting a reproducer but that'll take some time so can you revert the patch in the interim?

I think I know what the issue is, I'll prepare a fix on Friday, if it take longer I'll revert the change.

bmahjour removed a subscriber: bmahjour.Jul 15 2022, 11:41 AM

In D109368#3654077, @fhahn wrote:

In D109368#3653710, @paulwalker-arm wrote:

The issue is easily reproduced via user mode qemu (in my case I did this on x86) if that helps? Otherwise I can see about getting a reproducer but that'll take some time so can you revert the patch in the interim?

I think I know what the issue is, I'll prepare a fix on Friday, if it take longer I'll revert the change.

Should be fixed by aa00fb02c98a. The bot is back to green.

dmgreen mentioned this in D146033: [AArch64][TTI] Cost model FADD/FSUB/FNEG.Mar 17 2023, 6:18 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

7 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

1 line

LoopVectorizationPlanner.h

10 lines

LoopVectorize.cpp

223 lines

test/

Transforms/

LoopVectorize/

AArch64/

runtime-check-size-based-threshold.ll

18 lines

sve-tail-folding-forced.ll

2 lines

sve-tail-folding-unroll.ll

4 lines

sve-tail-folding.ll

22 lines

X86/

gather_scatter.ll

2 lines

pointer-runtime-checks-unprofitable.ll

60 lines

2 lines

2 lines

2 lines

22 lines

Diff 442093

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

	Show First 20 Lines • Show All 213 Lines • ▼ Show 20 Lines
	class LoopVectorizationRequirements {			class LoopVectorizationRequirements {
	public:			public:
	/// Track the 1st floating-point instruction that can not be reassociated.			/// Track the 1st floating-point instruction that can not be reassociated.
	void addExactFPMathInst(Instruction *I) {			void addExactFPMathInst(Instruction *I) {
	if (I && !ExactFPMathInst)			if (I && !ExactFPMathInst)
	ExactFPMathInst = I;			ExactFPMathInst = I;
	}			}

	void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; }

	Instruction *getExactFPInst() { return ExactFPMathInst; }			Instruction *getExactFPInst() { return ExactFPMathInst; }

	unsigned getNumRuntimePointerChecks() const {
	return NumRuntimePointerChecks;
	}

	private:			private:
	unsigned NumRuntimePointerChecks = 0;
	Instruction *ExactFPMathInst = nullptr;			Instruction *ExactFPMathInst = nullptr;
	};			};

	/// LoopVectorizationLegality checks if it is legal to vectorize a loop, and			/// LoopVectorizationLegality checks if it is legal to vectorize a loop, and
	/// to what vectorization factor.			/// to what vectorization factor.
	/// This class does not look at the profitability of vectorization, only the			/// This class does not look at the profitability of vectorization, only the
	/// legality. This class has two main kinds of checks:			/// legality. This class has two main kinds of checks:
	/// * Memory checks - The code in canVectorizeMemory checks if vectorization			/// * Memory checks - The code in canVectorizeMemory checks if vectorization
	▲ Show 20 Lines • Show All 338 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 987 Lines • ▼ Show 20 Lines	if (LAI->hasDependenceInvolvingLoopInvariantAddress()) {
"write to a loop invariant address could not "		"write to a loop invariant address could not "
"be vectorized",		"be vectorized",
"CantVectorizeStoreToLoopInvariantAddress", ORE, TheLoop);		"CantVectorizeStoreToLoopInvariantAddress", ORE, TheLoop);
return false;		return false;
}		}
}		}
}		}

Requirements->addRuntimePointerChecks(LAI->getNumRuntimePointerChecks());
PSE.addPredicate(LAI->getPSE().getPredicate());		PSE.addPredicate(LAI->getPSE().getPredicate());
return true;		return true;
}		}

bool LoopVectorizationLegality::canVectorizeFPMath(		bool LoopVectorizationLegality::canVectorizeFPMath(
bool EnableStrictReductions) {		bool EnableStrictReductions) {

// First check if there is any ExactFP math or if we allow reassociations		// First check if there is any ExactFP math or if we allow reassociations
▲ Show 20 Lines • Show All 439 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show All 27 Lines
#include "llvm/Support/InstructionCost.h"		#include "llvm/Support/InstructionCost.h"

namespace llvm {		namespace llvm {

class LoopInfo;		class LoopInfo;
class LoopVectorizationLegality;		class LoopVectorizationLegality;
class LoopVectorizationCostModel;		class LoopVectorizationCostModel;
class PredicatedScalarEvolution;		class PredicatedScalarEvolution;
class LoopVectorizationRequirements;
class LoopVectorizeHints;		class LoopVectorizeHints;
class OptimizationRemarkEmitter;		class OptimizationRemarkEmitter;
class TargetTransformInfo;		class TargetTransformInfo;
class TargetLibraryInfo;		class TargetLibraryInfo;
class VPRecipeBuilder;		class VPRecipeBuilder;

/// VPlan-based builder utility analogous to IRBuilder.		/// VPlan-based builder utility analogous to IRBuilder.
class VPBuilder {		class VPBuilder {
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	struct VectorizationFactor {
/// Vector width with best cost.		/// Vector width with best cost.
ElementCount Width;		ElementCount Width;
/// Cost of the loop with that width.		/// Cost of the loop with that width.
InstructionCost Cost;		InstructionCost Cost;

/// Cost of the scalar loop.		/// Cost of the scalar loop.
InstructionCost ScalarCost;		InstructionCost ScalarCost;

		/// The minimum trip count required to make vectorization profitable, e.g. due
		lebedev.riUnsubmitted Done Reply Inline Actions Nit: what does `prof` mean? PGO profile? Profitable? lebedev.ri: Nit: what does `prof` mean? PGO profile? Profitable?
		fhahnAuthorUnsubmitted Done Reply Inline Actions Changed to `MinProfitableTripCount` and add a comment fhahn: Changed to `MinProfitableTripCount` and add a comment
		/// to runtime checks.
		ElementCount MinProfitableTripCount;

VectorizationFactor(ElementCount Width, InstructionCost Cost,		VectorizationFactor(ElementCount Width, InstructionCost Cost,
InstructionCost ScalarCost)		InstructionCost ScalarCost)
: Width(Width), Cost(Cost), ScalarCost(ScalarCost) {}		: Width(Width), Cost(Cost), ScalarCost(ScalarCost) {}

/// Width 1 means no vectorization, cost 0 means uncomputed cost.		/// Width 1 means no vectorization, cost 0 means uncomputed cost.
static VectorizationFactor Disabled() {		static VectorizationFactor Disabled() {
return {ElementCount::getFixed(1), 0, 0};		return {ElementCount::getFixed(1), 0, 0};
}		}
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	class LoopVectorizationPlanner {

/// The interleaved access analysis.		/// The interleaved access analysis.
InterleavedAccessInfo &IAI;		InterleavedAccessInfo &IAI;

PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

const LoopVectorizeHints &Hints;		const LoopVectorizeHints &Hints;

LoopVectorizationRequirements &Requirements;

OptimizationRemarkEmitter *ORE;		OptimizationRemarkEmitter *ORE;

SmallVector<VPlanPtr, 4> VPlans;		SmallVector<VPlanPtr, 4> VPlans;

/// A builder used to construct the current plan.		/// A builder used to construct the current plan.
VPBuilder Builder;		VPBuilder Builder;

public:		public:
LoopVectorizationPlanner(Loop L, LoopInfo LI, const TargetLibraryInfo *TLI,		LoopVectorizationPlanner(Loop L, LoopInfo LI, const TargetLibraryInfo *TLI,
const TargetTransformInfo *TTI,		const TargetTransformInfo *TTI,
LoopVectorizationLegality *Legal,		LoopVectorizationLegality *Legal,
LoopVectorizationCostModel &CM,		LoopVectorizationCostModel &CM,
InterleavedAccessInfo &IAI,		InterleavedAccessInfo &IAI,
PredicatedScalarEvolution &PSE,		PredicatedScalarEvolution &PSE,
const LoopVectorizeHints &Hints,		const LoopVectorizeHints &Hints,
LoopVectorizationRequirements &Requirements,
OptimizationRemarkEmitter *ORE)		OptimizationRemarkEmitter *ORE)
: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), IAI(IAI),		: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), IAI(IAI),
PSE(PSE), Hints(Hints), Requirements(Requirements), ORE(ORE) {}		PSE(PSE), Hints(Hints), ORE(ORE) {}

/// Plan how to best vectorize, return the best VF and its cost, or None if		/// Plan how to best vectorize, return the best VF and its cost, or None if
/// vectorization and interleaving should be avoided up front.		/// vectorization and interleaving should be avoided up front.
Optional<VectorizationFactor> plan(ElementCount UserVF, unsigned UserIC);		Optional<VectorizationFactor> plan(ElementCount UserVF, unsigned UserIC);

/// Use the VPlan-native path to plan how to best vectorize, return the best		/// Use the VPlan-native path to plan how to best vectorize, return the best
/// VF and its cost.		/// VF and its cost.
VectorizationFactor planInVPlanNativePath(ElementCount UserVF);		VectorizationFactor planInVPlanNativePath(ElementCount UserVF);
▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 190 Lines • ▼ Show 20 Lines

/// Loops with a known constant trip count below this number are vectorized only /// Loops with a known constant trip count below this number are vectorized only

/// if no scalar iteration overheads are incurred. /// if no scalar iteration overheads are incurred.

static cl::opt<unsigned> TinyTripCountVectorThreshold( static cl::opt<unsigned> TinyTripCountVectorThreshold(

"vectorizer-min-trip-count", cl::init(16), cl::Hidden, "vectorizer-min-trip-count", cl::init(16), cl::Hidden,

cl::desc("Loops with a constant trip count that is smaller than this " cl::desc("Loops with a constant trip count that is smaller than this "

"value are vectorized only if no scalar iteration overheads " "value are vectorized only if no scalar iteration overheads "

"are incurred.")); "are incurred."));

static cl::opt<unsigned> PragmaVectorizeMemoryCheckThreshold( static cl::opt<unsigned> VectorizeMemoryCheckThreshold(

"pragma-vectorize-memory-check-threshold", cl::init(128), cl::Hidden, "vectorize-memory-check-threshold", cl::init(128), cl::Hidden,

cl::desc("The maximum allowed number of runtime memory checks with a " cl::desc("The maximum allowed number of runtime memory checks"));

"vectorize(enable) pragma."));

// Option prefer-predicate-over-epilogue indicates that an epilogue is undesired, // Option prefer-predicate-over-epilogue indicates that an epilogue is undesired,

// that predication is preferred, and this lists all options. I.e., the // that predication is preferred, and this lists all options. I.e., the

// vectorizer will try to fold the tail-loop (epilogue) into the vector body // vectorizer will try to fold the tail-loop (epilogue) into the vector body

// and predicate the instructions accordingly. If tail-folding fails, there are // and predicate the instructions accordingly. If tail-folding fails, there are

// different fallback strategies depending on these values: // different fallback strategies depending on these values:

namespace PreferPredicateTy { namespace PreferPredicateTy {

enum Option { enum Option {

▲ Show 20 Lines • Show All 226 Lines • ▼ Show 20 Lines

/// and reduction variables that were found to a given vectorization factor. /// and reduction variables that were found to a given vectorization factor.

class InnerLoopVectorizer { class InnerLoopVectorizer {

public: public:

InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE, InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,

LoopInfo *LI, DominatorTree *DT, LoopInfo *LI, DominatorTree *DT,

const TargetLibraryInfo *TLI, const TargetLibraryInfo *TLI,

const TargetTransformInfo *TTI, AssumptionCache *AC, const TargetTransformInfo *TTI, AssumptionCache *AC,

OptimizationRemarkEmitter *ORE, ElementCount VecWidth, OptimizationRemarkEmitter *ORE, ElementCount VecWidth,

ElementCount MinProfitableTripCount,

unsigned UnrollFactor, LoopVectorizationLegality *LVL, unsigned UnrollFactor, LoopVectorizationLegality *LVL,

LoopVectorizationCostModel *CM, BlockFrequencyInfo *BFI, LoopVectorizationCostModel *CM, BlockFrequencyInfo *BFI,

ProfileSummaryInfo *PSI, GeneratedRTChecks &RTChecks) ProfileSummaryInfo *PSI, GeneratedRTChecks &RTChecks)

dmgreenUnsubmitted

Done

Can this have a default value, to prevent the need for multiple constructors?

dmgreen: Can this have a default value, to prevent the need for multiple constructors?

fhahnAuthorUnsubmitted

Done

Can this have a default value, to prevent the need for multiple constructors?

unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.

Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.

That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up

fhahn: > Can this have a default value, to prevent the need for multiple constructors? unfortunately…

dmgreenUnsubmitted

Not Done

unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.

I was expecting it to use = ElementCount(), but this sounds OK too. Is the InnerLoopUnroller value deliberately 1, or would passing zero be better? I imagine it doesn't make much difference in practice.

dmgreen: > unfortunately that's not possible with the currently available constructors. But I made it a…

ebrevnovUnsubmitted

Done

Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.

ebrevnov: Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer…

: OrigLoop(OrigLoop), PSE(PSE), LI(LI), DT(DT), TLI(TLI), TTI(TTI), : OrigLoop(OrigLoop), PSE(PSE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),

AC(AC), ORE(ORE), VF(VecWidth), UF(UnrollFactor), AC(AC), ORE(ORE), VF(VecWidth), UF(UnrollFactor),

Builder(PSE.getSE()->getContext()), Legal(LVL), Cost(CM), BFI(BFI), Builder(PSE.getSE()->getContext()), Legal(LVL), Cost(CM), BFI(BFI),

PSI(PSI), RTChecks(RTChecks) { PSI(PSI), RTChecks(RTChecks) {

// Query this against the original loop and save it here because the profile // Query this against the original loop and save it here because the profile

// of the original loop header may change as the transformation happens. // of the original loop header may change as the transformation happens.

OptForSizeBasedOnProfile = llvm::shouldOptimizeForSize( OptForSizeBasedOnProfile = llvm::shouldOptimizeForSize(

OrigLoop->getHeader(), PSI, BFI, PGSOQueryType::IRPass); OrigLoop->getHeader(), PSI, BFI, PGSOQueryType::IRPass);

if (MinProfitableTripCount.isZero())

this->MinProfitableTripCount = VecWidth;

ebrevnovUnsubmitted

Not Done

IHMO we better keep semantic of 'MinProfitableTripCount' and not change it's value to something not coming from profitability considerations. Can easily do required adjustments at use site(s).

ebrevnov: IHMO we better keep semantic of 'MinProfitableTripCount' and not change it's value to something…

else

this->MinProfitableTripCount = MinProfitableTripCount;

} }

virtual ~InnerLoopVectorizer() = default; virtual ~InnerLoopVectorizer() = default;

/// Create a new empty loop that will contain vectorized instructions later /// Create a new empty loop that will contain vectorized instructions later

/// on, while the old loop will be used as the scalar remainder. Control flow /// on, while the old loop will be used as the scalar remainder. Control flow

/// is generated around the vectorized (and scalar epilogue) loops consisting /// is generated around the vectorized (and scalar epilogue) loops consisting

/// of various checks and bypasses. Return the pre-header block of the new /// of various checks and bypasses. Return the pre-header block of the new

▲ Show 20 Lines • Show All 187 Lines • ▼ Show 20 Lines protected:

/// Interface to emit optimization remarks. /// Interface to emit optimization remarks.

OptimizationRemarkEmitter *ORE; OptimizationRemarkEmitter *ORE;

/// The vectorization SIMD factor to use. Each vector will have this many /// The vectorization SIMD factor to use. Each vector will have this many

/// vector elements. /// vector elements.

ElementCount VF; ElementCount VF;

ElementCount MinProfitableTripCount;

/// The vectorization unroll factor to use. Each scalar is vectorized to this /// The vectorization unroll factor to use. Each scalar is vectorized to this

/// many different vector instructions. /// many different vector instructions.

unsigned UF; unsigned UF;

/// The builder that we use /// The builder that we use

IRBuilder<> Builder; IRBuilder<> Builder;

// --- Vectorization state --- // --- Vectorization state ---

▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,

LoopInfo *LI, DominatorTree *DT, LoopInfo *LI, DominatorTree *DT,

const TargetLibraryInfo *TLI, const TargetLibraryInfo *TLI,

const TargetTransformInfo *TTI, AssumptionCache *AC, const TargetTransformInfo *TTI, AssumptionCache *AC,

OptimizationRemarkEmitter *ORE, unsigned UnrollFactor, OptimizationRemarkEmitter *ORE, unsigned UnrollFactor,

LoopVectorizationLegality *LVL, LoopVectorizationLegality *LVL,

LoopVectorizationCostModel *CM, BlockFrequencyInfo *BFI, LoopVectorizationCostModel *CM, BlockFrequencyInfo *BFI,

ProfileSummaryInfo *PSI, GeneratedRTChecks &Check) ProfileSummaryInfo *PSI, GeneratedRTChecks &Check)

: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, : InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,

ElementCount::getFixed(1),

ElementCount::getFixed(1), UnrollFactor, LVL, CM, ElementCount::getFixed(1), UnrollFactor, LVL, CM,

BFI, PSI, Check) {} BFI, PSI, Check) {}

private: private:

Value *getBroadcastInstrs(Value *V) override; Value *getBroadcastInstrs(Value *V) override;

}; };

/// Encapsulate information regarding vectorization of a loop and its epilogue. /// Encapsulate information regarding vectorization of a loop and its epilogue.

Show All 32 Lines InnerLoopAndEpilogueVectorizer(

Loop *OrigLoop, PredicatedScalarEvolution &PSE, LoopInfo *LI, Loop *OrigLoop, PredicatedScalarEvolution &PSE, LoopInfo *LI,

DominatorTree *DT, const TargetLibraryInfo *TLI, DominatorTree *DT, const TargetLibraryInfo *TLI,

const TargetTransformInfo *TTI, AssumptionCache *AC, const TargetTransformInfo *TTI, AssumptionCache *AC,

OptimizationRemarkEmitter *ORE, EpilogueLoopVectorizationInfo &EPI, OptimizationRemarkEmitter *ORE, EpilogueLoopVectorizationInfo &EPI,

LoopVectorizationLegality *LVL, llvm::LoopVectorizationCostModel *CM, LoopVectorizationLegality *LVL, llvm::LoopVectorizationCostModel *CM,

BlockFrequencyInfo *BFI, ProfileSummaryInfo *PSI, BlockFrequencyInfo *BFI, ProfileSummaryInfo *PSI,

GeneratedRTChecks &Checks) GeneratedRTChecks &Checks)

: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, : InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE,

EPI.MainLoopVF, EPI.MainLoopUF, LVL, CM, BFI, PSI, EPI.MainLoopVF, EPI.MainLoopVF, EPI.MainLoopUF, LVL,

Checks), CM, BFI, PSI, Checks),

EPI(EPI) {} EPI(EPI) {}

// Override this function to handle the more complex control flow around the // Override this function to handle the more complex control flow around the

// three loops. // three loops.

std::pair<BasicBlock *, Value *> std::pair<BasicBlock *, Value *>

createVectorizedLoopSkeleton() final override { createVectorizedLoopSkeleton() final override {

return createEpilogueVectorizedLoopSkeleton(); return createEpilogueVectorizedLoopSkeleton();

} }

▲ Show 20 Lines • Show All 1,048 Lines • ▼ Show 20 Lines class GeneratedRTChecks {

/// The value representing the result of the generated memory runtime checks. /// The value representing the result of the generated memory runtime checks.

/// If it is nullptr, either no memory runtime checks have been generated or /// If it is nullptr, either no memory runtime checks have been generated or

/// they have been used. /// they have been used.

Value *MemRuntimeCheckCond = nullptr; Value *MemRuntimeCheckCond = nullptr;

DominatorTree *DT; DominatorTree *DT;

LoopInfo *LI; LoopInfo *LI;

TargetTransformInfo *TTI;

SCEVExpander SCEVExp; SCEVExpander SCEVExp;

SCEVExpander MemCheckExp; SCEVExpander MemCheckExp;

bool CostTooHigh = false;

public: public:

GeneratedRTChecks(ScalarEvolution &SE, DominatorTree *DT, LoopInfo *LI, GeneratedRTChecks(ScalarEvolution &SE, DominatorTree *DT, LoopInfo *LI,

const DataLayout &DL) TargetTransformInfo *TTI, const DataLayout &DL)

: DT(DT), LI(LI), SCEVExp(SE, DL, "scev.check"), : DT(DT), LI(LI), TTI(TTI), SCEVExp(SE, DL, "scev.check"),

MemCheckExp(SE, DL, "scev.check") {} MemCheckExp(SE, DL, "scev.check") {}

/// Generate runtime checks in SCEVCheckBlock and MemCheckBlock, so we can /// Generate runtime checks in SCEVCheckBlock and MemCheckBlock, so we can

/// accurately estimate the cost of the runtime checks. The blocks are /// accurately estimate the cost of the runtime checks. The blocks are

/// un-linked from the IR and is added back during vector code generation. If /// un-linked from the IR and is added back during vector code generation. If

/// there is no vector code generation, the check blocks are removed /// there is no vector code generation, the check blocks are removed

/// completely. /// completely.

void Create(Loop *L, const LoopAccessInfo &LAI, void Create(Loop *L, const LoopAccessInfo &LAI,

const SCEVPredicate &UnionPred, ElementCount VF, unsigned IC) { const SCEVPredicate &UnionPred, ElementCount VF, unsigned IC) {

// Hard cutoff to limit compile-time increase in case a very large number of

// runtime checks needs to be generated.

// TODO: Skip cutoff if the loop is guaranteed to execute, e.g. due to

// profile info.

CostTooHigh =

LAI.getNumRuntimePointerChecks() > VectorizeMemoryCheckThreshold;

if (CostTooHigh)

return;

BasicBlock *LoopHeader = L->getHeader(); BasicBlock *LoopHeader = L->getHeader();

BasicBlock *Preheader = L->getLoopPreheader(); BasicBlock *Preheader = L->getLoopPreheader();

// Use SplitBlock to create blocks for SCEV & memory runtime checks to // Use SplitBlock to create blocks for SCEV & memory runtime checks to

// ensure the blocks are properly added to LoopInfo & DominatorTree. Those // ensure the blocks are properly added to LoopInfo & DominatorTree. Those

// may be used by SCEVExpander. The blocks will be un-linked from their // may be used by SCEVExpander. The blocks will be un-linked from their

// predecessors and removed from LI & DT at the end of the function. // predecessors and removed from LI & DT at the end of the function.

if (!UnionPred.isAlwaysTrue()) { if (!UnionPred.isAlwaysTrue()) {

▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines if (MemCheckBlock) {

LI->removeBlock(MemCheckBlock); LI->removeBlock(MemCheckBlock);

} }

if (SCEVCheckBlock) { if (SCEVCheckBlock) {

DT->eraseNode(SCEVCheckBlock); DT->eraseNode(SCEVCheckBlock);

LI->removeBlock(SCEVCheckBlock); LI->removeBlock(SCEVCheckBlock);

} }

InstructionCost getCost() {

if (SCEVCheckBlock || MemCheckBlock)

LLVM_DEBUG(dbgs() << "Calculating cost of runtime checks:\n");

if (CostTooHigh) {

InstructionCost Cost;

Cost.setInvalid();

LLVM_DEBUG(dbgs() << " number of checks exceeded threshold\n");

return Cost;

}

InstructionCost RTCheckCost = 0;

if (SCEVCheckBlock)

dmgreenUnsubmitted

Done

Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost?
It appears that the cost of a vscale is coming through as 10, where it should be 1. I'm not sure if any of the other processing done in CM.getInstructionCost is very useful for the scalar runtime checks?
I've added some quick tests (rG75631438e333) to show that the cost should be 1, it is just not treated as a "vectorizable intrinsic" by CM, so given the generic cost of a call.

dmgreen: Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost? It…

fhahnAuthorUnsubmitted

Done

Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost?

Good point, as we are only interested in the scalar cost it should be sufficient to use TTI. Updated, thanks!

fhahn: > Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost?

for (Instruction &I : *SCEVCheckBlock) {

if (SCEVCheckBlock->getTerminator() == &I)

continue;

InstructionCost C =

TTI->getInstructionCost(&I, TTI::TCK_RecipThroughput);

LLVM_DEBUG(dbgs() << " " << C << " for " << I << "\n");

RTCheckCost += C;

dmgreenUnsubmitted

Done

On a related note, can we get this to print the costs of each of the instructions in the runtime checks? It is useful for debugging when the numbers are incorrect. Otherwise at the moment I believe it just prints the final MinProfitableTripCount without any explanation of how it got there.

dmgreen: On a related note, can we get this to print the costs of each of the instructions in the…

fhahnAuthorUnsubmitted

Done

I update dthe code to print instruction costs here. An interesting result is that it seems our cost estimates for GEPs currently always assume it is used in instructions where constant can be folded via the addressing mode. This is not true when using the pointers in compares as we do here. So the cost is slightly underestimated.

fhahn: I update dthe code to print instruction costs here. An interesting result is that it seems our…

}

if (MemCheckBlock)

for (Instruction &I : *MemCheckBlock) {

if (MemCheckBlock->getTerminator() == &I)

continue;

InstructionCost C =

TTI->getInstructionCost(&I, TTI::TCK_RecipThroughput);

LLVM_DEBUG(dbgs() << " " << C << " for " << I << "\n");

RTCheckCost += C;

}

if (SCEVCheckBlock || MemCheckBlock)

LLVM_DEBUG(dbgs() << "Total cost of runtime checks: " << RTCheckCost

<< "\n");

return RTCheckCost;

}

/// Remove the created SCEV & memory runtime check blocks & instructions, if /// Remove the created SCEV & memory runtime check blocks & instructions, if

/// unused. /// unused.

~GeneratedRTChecks() { ~GeneratedRTChecks() {

SCEVExpanderCleaner SCEVCleaner(SCEVExp); SCEVExpanderCleaner SCEVCleaner(SCEVExp);

SCEVExpanderCleaner MemCheckCleaner(MemCheckExp); SCEVExpanderCleaner MemCheckCleaner(MemCheckExp);

if (!SCEVCheckCond) if (!SCEVCheckCond)

SCEVCleaner.markResultUsed(); SCEVCleaner.markResultUsed();

▲ Show 20 Lines • Show All 926 Lines • ▼ Show 20 Lines void InnerLoopVectorizer::emitIterationCountCheck(BasicBlock *Bypass) {

// to the backedge-taken count overflowed leading to an incorrect trip count // to the backedge-taken count overflowed leading to an incorrect trip count

// of zero. In this case we will also jump to the scalar loop. // of zero. In this case we will also jump to the scalar loop.

auto P = Cost->requiresScalarEpilogue(VF) ? ICmpInst::ICMP_ULE auto P = Cost->requiresScalarEpilogue(VF) ? ICmpInst::ICMP_ULE

: ICmpInst::ICMP_ULT; : ICmpInst::ICMP_ULT;

// If tail is to be folded, vector loop takes care of all iterations. // If tail is to be folded, vector loop takes care of all iterations.

Type *CountTy = Count->getType(); Type *CountTy = Count->getType();

Value *CheckMinIters = Builder.getFalse(); Value *CheckMinIters = Builder.getFalse();

Value *Step = createStepForVF(Builder, CountTy, VF, UF); auto CreateStep = [&]() {

// Create step with max(MinProTripCount, UF * VF).

if (UF * VF.getKnownMinValue() < MinProfitableTripCount.getKnownMinValue())

return createStepForVF(Builder, CountTy, MinProfitableTripCount, 1);

return createStepForVF(Builder, CountTy, VF, UF);

};

if (!Cost->foldTailByMasking()) if (!Cost->foldTailByMasking())

lebedev.riUnsubmitted

Done

This looks like std::max(step, MinProfTripCount)?
Might be worth adding a comment.

lebedev.ri: This looks like `std::max(step, MinProfTripCount)`? Might be worth adding a comment.

CheckMinIters = Builder.CreateICmp(P, Count, Step, "min.iters.check"); CheckMinIters =

Builder.CreateICmp(P, Count, CreateStep(), "min.iters.check");

else if (VF.isScalable()) { else if (VF.isScalable()) {

// vscale is not necessarily a power-of-2, which means we cannot guarantee // vscale is not necessarily a power-of-2, which means we cannot guarantee

// an overflow to zero when updating induction variables and so an // an overflow to zero when updating induction variables and so an

// additional overflow check is required before entering the vector loop. // additional overflow check is required before entering the vector loop.

// Get the maximum unsigned value for the type. // Get the maximum unsigned value for the type.

Value *MaxUIntTripCount = Value *MaxUIntTripCount =

ConstantInt::get(CountTy, cast<IntegerType>(CountTy)->getMask()); ConstantInt::get(CountTy, cast<IntegerType>(CountTy)->getMask());

Value *LHS = Builder.CreateSub(MaxUIntTripCount, Count); Value *LHS = Builder.CreateSub(MaxUIntTripCount, Count);

// Don't execute the vector loop if (UMax - n) < (VF * UF). // Don't execute the vector loop if (UMax - n) < (VF * UF).

CheckMinIters = Builder.CreateICmp(ICmpInst::ICMP_ULT, LHS, Step); CheckMinIters = Builder.CreateICmp(ICmpInst::ICMP_ULT, LHS, CreateStep());

} }

// Create new preheader for vector loop. // Create new preheader for vector loop.

LoopVectorPreHeader = LoopVectorPreHeader =

SplitBlock(TCCheckBlock, TCCheckBlock->getTerminator(), DT, LI, nullptr, SplitBlock(TCCheckBlock, TCCheckBlock->getTerminator(), DT, LI, nullptr,

"vector.ph"); "vector.ph");

assert(DT->properlyDominates(DT->getNode(TCCheckBlock), assert(DT->properlyDominates(DT->getNode(TCCheckBlock),

DT->getNode(Bypass)->getIDom()) && DT->getNode(Bypass)->getIDom()) &&

"TC check is expected to dominate Bypass"); "TC check is expected to dominate Bypass");

// Update dominator for Bypass & LoopExit (if needed). // Update dominator for Bypass & LoopExit (if needed).

DT->changeImmediateDominator(Bypass, TCCheckBlock); DT->changeImmediateDominator(Bypass, TCCheckBlock);

if (!Cost->requiresScalarEpilogue(VF)) if (!Cost->requiresScalarEpilogue(VF))

// If there is an epilogue which must run, there's no edge from the // If there is an epilogue which must run, there's no edge from the

// middle block to exit blocks and thus no need to update the immediate // middle block to exit blocks and thus no need to update the immediate

// dominator of the exit blocks. // dominator of the exit blocks.

DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock); DT->changeImmediateDominator(LoopExitBlock, TCCheckBlock);

ReplaceInstWithInst( ReplaceInstWithInst(

TCCheckBlock->getTerminator(), TCCheckBlock->getTerminator(),

BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters)); BranchInst::Create(Bypass, LoopVectorPreHeader, CheckMinIters));

LoopBypassBlocks.push_back(TCCheckBlock); LoopBypassBlocks.push_back(TCCheckBlock);

} }

BasicBlock *InnerLoopVectorizer::emitSCEVChecks(BasicBlock *Bypass) { BasicBlock *InnerLoopVectorizer::emitSCEVChecks(BasicBlock *Bypass) {

BasicBlock *const SCEVCheckBlock = BasicBlock *const SCEVCheckBlock =

RTChecks.emitSCEVChecks(Bypass, LoopVectorPreHeader, LoopExitBlock); RTChecks.emitSCEVChecks(Bypass, LoopVectorPreHeader, LoopExitBlock);

if (!SCEVCheckBlock) if (!SCEVCheckBlock)

return nullptr; return nullptr;

assert(!(SCEVCheckBlock->getParent()->hasOptSize() || assert(!(SCEVCheckBlock->getParent()->hasOptSize() ||

(OptForSizeBasedOnProfile && (OptForSizeBasedOnProfile &&

Cost->Hints->getForce() != LoopVectorizeHints::FK_Enabled)) && Cost->Hints->getForce() != LoopVectorizeHints::FK_Enabled)) &&

▲ Show 20 Lines • Show All 4,426 Lines • ▼ Show 20 Lines LoopVectorizationPlanner::planInVPlanNativePath(ElementCount UserVF) {

} }

LLVM_DEBUG( LLVM_DEBUG(

dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the " dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "

"VPlan-native path.\n"); "VPlan-native path.\n");

return VectorizationFactor::Disabled(); return VectorizationFactor::Disabled();

} }

bool LoopVectorizationPlanner::requiresTooManyRuntimeChecks() const {

unsigned NumRuntimePointerChecks = Requirements.getNumRuntimePointerChecks();

return (NumRuntimePointerChecks >

VectorizerParams::RuntimeMemoryCheckThreshold &&

!Hints.allowReordering()) ||

NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold;

}

Optional<VectorizationFactor> Optional<VectorizationFactor>

LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC) { LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC) {

assert(OrigLoop->isInnermost() && "Inner loop expected."); assert(OrigLoop->isInnermost() && "Inner loop expected.");

FixedScalableVFPair MaxFactors = CM.computeMaxVF(UserVF, UserIC); FixedScalableVFPair MaxFactors = CM.computeMaxVF(UserVF, UserIC);

if (!MaxFactors) // Cases that should not to be vectorized nor interleaved. if (!MaxFactors) // Cases that should not to be vectorized nor interleaved.

return None; return None;

// Invalidate interleave groups if all blocks of loop will be predicated. // Invalidate interleave groups if all blocks of loop will be predicated.

▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines

} }

VPlan &LoopVectorizationPlanner::getBestPlanFor(ElementCount VF) const { VPlan &LoopVectorizationPlanner::getBestPlanFor(ElementCount VF) const {

assert(count_if(VPlans, assert(count_if(VPlans,

[VF](const VPlanPtr &Plan) { return Plan->hasVF(VF); }) == [VF](const VPlanPtr &Plan) { return Plan->hasVF(VF); }) ==

1 && 1 &&

"Best VF has not a single VPlan."); "Best VF has not a single VPlan.");

for (const VPlanPtr &Plan : VPlans) { for (const VPlanPtr &Plan : VPlans) {

ebrevnovUnsubmitted

Done

Would it be better to use !SelectedVF.Width.isScalar() instead. That will make it obvious that RTCost should be applied to vector loops only.

ebrevnov: Would it be better to use !SelectedVF.Width.isScalar() instead. That will make it obvious that…

if (Plan->hasVF(VF)) if (Plan->hasVF(VF))

ebrevnovUnsubmitted

Done

In general, I would like us to agree on the strategy to follow when number of iterations is not known at compile time.
Currently, I see the following inconsistency in use of getSmallBestKnownTC. On the one hand, it may return upper bound estimate (SE.getSmallConstantMaxTripCount) which may greatly overestimate real trip count. On the over hand, it returns None if SE.getSmallConstantMaxTripCount didn't manage to get an estimate and we will end up skipping the check entirely.
I see 3 possible ways to follow in the case of unknow trip count:

Don't check RT overhead if trip count is unknown at compile time. This is most conservative solution.
Assume maximum possible number of iterations. In practice that will essentially give the same result as 1) (because RTCost / double(*ExpectedTC) would be less than 1 ). This is essentially what this patch does. In addition we would need to take max value for None as well
Assume some fixed reasonable average number of iterations (may be extended to a more complex heuristic in future).

While in theory 3) is more general and may give better estimate it mat require panful tuning. Given that I would vote for 1) at this moment as the most conservative.

Note also regardless of chosen way to go we should decide how to categorize trip count deduced from profile. I personally think we should consider it same way as known trip count.

ebrevnov: In general, I would like us to agree on the strategy to follow when number of iterations is not…

return *Plan.get(); return *Plan.get();

} }

llvm_unreachable("No plan found!"); llvm_unreachable("No plan found!");

ebrevnovUnsubmitted

Done

In general case , total vector cost is "PrologCost + VectorCost*(ExpectedTC/Width) + EpilogCost, where (ExpectedTC/Width) is an integer division. RTCost is just a part of prolog cost. I think it is worth mentioning. Then we should explain why/how it reduces to "RTCost + VectorCost* (ExpectedTC/Width)". Thus we end up with the check "ScalarCost * ExpectedTC <= RTCost + VectorCost* (ExpectedTC/Width)". Division by ExpectedTC (using FP) makes sense to me and should give acceptable accuracy. Multiplication by 'Width' may give up to 'Width - 1' error. For that reason I would avoid doing multiplication by 'Width'. Thus we end up with "ScalarCost <= RTCost/(double)ExpectedTC + VectorCost/Width".

There are still two cases not taken into account (foldTailByMasking() and equiresScalarEpilogue()) but I think it's OK for this type of check.

Makes sense?

ebrevnov: In general case , total vector cost is "PrologCost + VectorCost*(ExpectedTC/Width) + EpilogCost…

} }

static void AddRuntimeUnrollDisableMetaData(Loop *L) { static void AddRuntimeUnrollDisableMetaData(Loop *L) {

SmallVector<Metadata *, 4> MDs; SmallVector<Metadata *, 4> MDs;

// Reserve first location for self reference to the LoopID metadata node. // Reserve first location for self reference to the LoopID metadata node.

dmgreenUnsubmitted

Done

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.

It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).

And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.

dmgreen: This should be `RtC + VecC * floor(TC / VF) + EpiC` or `RtC + VecC * ceil(TC / VF)` when…

ebrevnovUnsubmitted

Done

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then).

Since we already compute upper estimate for MinTC1 it doesn't seem to be necessary to do additional adjustment when folding tail. But we probably want/need to do an adjustment for "requiresScalarEpilogue" case.

ebrevnov: > This should be `RtC + VecC * floor(TC / VF) + EpiC` or `RtC + VecC * ceil(TC / VF)` when…

fhahnAuthorUnsubmitted

Done

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.

I added a statement below about the fact that the computations are performed on doubles and later rounded up, giving an upper bound estimate as @ebrevnov suggested. Do you think that's sufficient?

It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).

Yep, that's a fundamental assumption at the moment. Unfortunately I cannot think of a good way to estimate the probability of the checks passing. If we assign a fixed probability per runtime check we are likely ending up with a hard limit like we have at the moment, just expressed differently.

The main motivation of MinTC2 below is to introduce a limit on the impact of a (large) number of runtime checks. The main goal is preventing evaluating large runtime checks for short running loops (at the moment we allow increasing total runtime by 10% due to failing runtime checks, but this could also be lower).

While there might be additional cases where failing runtime checks cause increase in runtime, the same problem already exists even with the hard-coded limit we have at the moment.

We could also change to way we emit runtime checks slightly and break them up across multiple blocks with earlier exits to increase the chances we do not have to evaluate all runtime checks if some fail.

And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.

Agreed, this can be an unfortunate side effect. Again, this is a problem we are already hitting and this patch will add a bit more vectorized loops. But I think in general the impact on the number of loops vectorized with this patch should be relatively small (forSPEC2006/SPEC2017/MultiSource ~1% more loops are vectorized). And I think unfortunately there's not much we can do to avoid this check in general.

One follow-up I think that becomes important is to make sure that we try to use PGO to detect cases where we create dead vector loops and skip vectorizing them.

fhahn: > This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding…

dmgreenUnsubmitted

Not Done

As far as I understand (correct me if I'm wrong!) we are essentially changing from code that looked like:

if (N < VF) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop

To the same code with, but with a different initial guard value and potentially more runtime checks in places:

if (N < MinProfitableTripCount) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop

That means that if we under-estimate MinProfitableTripCount we go into the runtime checks/vector loop, potentially executing a lot of expensive runtime checks where it is not profitable.
If we _over_ estimate the MinProfitableTripCount then at runtime we will not execute the vector code, falling back to the scalar loop. So we have generated larger/less efficient scalar code that then never executes the vector part, even if it would be profitable to do so.

So we end up in the unfortunate place where either over or under estimating the cost can lead to inefficiencies.

I'm not too worried about the details here. They sounds fine for the most part so long as they are close enough. I'm more worried about the cost of the runtime checks being over-estimated due to them being unsimplified prior to costing. I think that is where the worst regressions I am seeing from this patch come from. Loops where vector code was previously generated and executed are now skipped over. Unfortunately loops with lowish trip counts are common in a lot of code :)

The code in LoopVectorizationCostModel::isMoreProfitable already talks about the cost in terms of PerIterationCost*ceil(TripCount/VF) vs PerIterationCost*floor(TC/VF) though, and I would recommend describing things in the same way here, explaining that RtC + VecC * (TC / VF) + EpiC is a simplification of that.

dmgreen: As far as I understand (correct me if I'm wrong!) we are essentially changing from code that…

MDs.push_back(nullptr); MDs.push_back(nullptr);

bool IsUnrollMetadata = false; bool IsUnrollMetadata = false;

MDNode *LoopID = L->getLoopID(); MDNode *LoopID = L->getLoopID();

if (LoopID) { if (LoopID) {

ebrevnovUnsubmitted

Not Done

I would suggest keeping result in a variable to avoid coping the (possibly non-trivial) formula.

ebrevnov: I would suggest keeping result in a variable to avoid coping the (possibly non-trivial) formula.

// First find existing loop unrolling disable metadata. // First find existing loop unrolling disable metadata.

for (unsigned i = 1, ie = LoopID->getNumOperands(); i < ie; ++i) { for (unsigned i = 1, ie = LoopID->getNumOperands(); i < ie; ++i) {

auto *MD = dyn_cast<MDNode>(LoopID->getOperand(i)); auto *MD = dyn_cast<MDNode>(LoopID->getOperand(i));

if (MD) { if (MD) {

const auto *S = dyn_cast<MDString>(MD->getOperand(0)); const auto *S = dyn_cast<MDString>(MD->getOperand(0));

IsUnrollMetadata = IsUnrollMetadata =

S && S->getString().startswith("llvm.loop.unroll.disable"); S && S->getString().startswith("llvm.loop.unroll.disable");

lebedev.riUnsubmitted

Done

// total scalar cost:

- // RtC + VecC (TC / VF) + EpiC < ScalarC * TC

+ // RtC + VecC * (TC / VF) + EpiC < ScalarC * TC

// Now we can compute the minimum required trip count TC as

lebedev.ri:

} }

MDs.push_back(LoopID->getOperand(i)); MDs.push_back(LoopID->getOperand(i));

} }

lebedev.riUnsubmitted

Done

// RtC + VecC (TC / VF) + EpiC < ScalarC * TC

// Now we can compute the minimum required trip count TC as

- // (RtC + EpiC) / (ScalarC + (VecC / VF)) < TC

+ // (RtC + EpiC) / (ScalarC - (VecC / VF)) < TC

// For now we assume the epilogue cost EpiC = 0 for simplicity.

I'm rather sure we can't :)

lebedev.ri: I'm rather sure we can't :)

ebrevnovUnsubmitted

Done

The expression 'VecC imul (TC idiv VF)' is not associative since it involves an integer division (idiv). Thus it's not legal to simply replace it with 'TC fmul (VecC fdiv VF)'. Maximum possible error is up to 'VF'. In other words 'TC fmul (VecC fdiv VF)' - 'VecC imul (TC idiv VF)' <= VF. That means 'MinTC1' computed that way is an upper estimate of actual minimum. I don't see anything terribly bad in taking upper estimate but IMHO worth mentioning in comments.

ebrevnov: The expression 'VecC imul (TC idiv VF)' is not associative since it involves an integer…

fhahnAuthorUnsubmitted

Done

Thanks, I tried to update the comment to make this clearer.

fhahn: Thanks, I tried to update the comment to make this clearer.

} }

dmgreenUnsubmitted

Done

Perhaps add some more assumptions, like the ones mentioned above and that (ScalarC - (VecC / VF)) > 0 from other profitability checks.

dmgreen: Perhaps add some more assumptions, like the ones mentioned above and that (ScalarC - (VecC /…

ebrevnovUnsubmitted

Done

It's totally fine for '(ScalarC - (VecC / VF))' to be negative since we take max(MinTC1, MinTC2) and MinTC2>=0

ebrevnov: It's totally fine for '(ScalarC - (VecC / VF))' to be negative since we take max(MinTC1…

fhahnAuthorUnsubmitted

Done

I think it should not really matter as @ebrevnov said, but it being negative may yield interesting cases to check the cost-modeling.

fhahn: I think it should not really matter as @ebrevnov said, but it being negative may yield…

if (!IsUnrollMetadata) { if (!IsUnrollMetadata) {

// Add runtime unroll disable metadata. // Add runtime unroll disable metadata.

LLVMContext &Context = L->getHeader()->getContext(); LLVMContext &Context = L->getHeader()->getContext();

lebedev.riUnsubmitted

Done

This is not VecC, it's VecC/VF.
Not an error, but confusing.

lebedev.ri: This is not `VecC`, it's `VecC/VF`. Not an error, but confusing.

fhahnAuthorUnsubmitted

Done

Changed to VecCOverVF.

fhahn: Changed to `VecCOverVF`.

SmallVector<Metadata *, 1> DisableOperands; SmallVector<Metadata *, 1> DisableOperands;

DisableOperands.push_back( DisableOperands.push_back(

lebedev.riUnsubmitted

Not Done

I agree that here the math checks out, the sign is correct.

lebedev.ri: I agree that here the math checks out, the sign is correct.

MDString::get(Context, "llvm.loop.unroll.runtime.disable")); MDString::get(Context, "llvm.loop.unroll.runtime.disable"));

MDNode *DisableNode = MDNode::get(Context, DisableOperands); MDNode *DisableNode = MDNode::get(Context, DisableOperands);

MDs.push_back(DisableNode); MDs.push_back(DisableNode);

MDNode *NewLoopID = MDNode::get(Context, MDs); MDNode *NewLoopID = MDNode::get(Context, MDs);

// Set operand 0 to refer to the loop id itself. // Set operand 0 to refer to the loop id itself.

NewLoopID->replaceOperandWith(0, NewLoopID); NewLoopID->replaceOperandWith(0, NewLoopID);

L->setLoopID(NewLoopID); L->setLoopID(NewLoopID);

} }

void LoopVectorizationPlanner::executePlan(ElementCount BestVF, unsigned BestUF, void LoopVectorizationPlanner::executePlan(ElementCount BestVF, unsigned BestUF,

VPlan &BestVPlan, VPlan &BestVPlan,

InnerLoopVectorizer &ILV, InnerLoopVectorizer &ILV,

DominatorTree *DT, DominatorTree *DT,

lebedev.riUnsubmitted

Done

So does this round up or down?
Use alignTo()/alignDown()?

lebedev.ri: So does this round up or down? Use `alignTo()`/`alignDown()`?

fhahnAuthorUnsubmitted

Done

Ah that was the one I was looking for! updated thanks!

fhahn: Ah that was the one I was looking for! updated thanks!

bool IsEpilogueVectorization) { bool IsEpilogueVectorization) {

LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF << ", UF=" << BestUF LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF << ", UF=" << BestUF

ebrevnovUnsubmitted

Done

Please use alignTo() instead.

ebrevnov: Please use alignTo() instead.

fhahnAuthorUnsubmitted

Done

updated, thanks!

fhahn: updated, thanks!

<< '\n'); << '\n');

// Perform the actual loop transformation. // Perform the actual loop transformation.

// 1. Set up the skeleton for vectorization, including vector pre-header and // 1. Set up the skeleton for vectorization, including vector pre-header and

// middle block. The vector loop is created during VPlan execution. // middle block. The vector loop is created during VPlan execution.

VPTransformState State{BestVF, BestUF, LI, DT, ILV.Builder, &ILV, &BestVPlan}; VPTransformState State{BestVF, BestUF, LI, DT, ILV.Builder, &ILV, &BestVPlan};

Value *CanonicalIVStartValue; Value *CanonicalIVStartValue;

▲ Show 20 Lines • Show All 2,592 Lines • ▼ Show 20 Lines static bool processLoopInVPlanNativePath(

ScalarEpilogueLowering SEL = getScalarEpilogueLowering( ScalarEpilogueLowering SEL = getScalarEpilogueLowering(

F, L, Hints, PSI, BFI, TTI, TLI, AC, LI, PSE.getSE(), DT, *LVL); F, L, Hints, PSI, BFI, TTI, TLI, AC, LI, PSE.getSE(), DT, *LVL);

LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F, LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,

&Hints, IAI); &Hints, IAI);

// Use the planner for outer loop vectorization. // Use the planner for outer loop vectorization.

// TODO: CM is not used at this point inside the planner. Turn CM into an // TODO: CM is not used at this point inside the planner. Turn CM into an

// optional argument if we don't need it in the future. // optional argument if we don't need it in the future.

LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM, IAI, PSE, Hints, LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM, IAI, PSE, Hints, ORE);

Requirements, ORE);

// Get user vectorization factor. // Get user vectorization factor.

ElementCount UserVF = Hints.getWidth(); ElementCount UserVF = Hints.getWidth();

CM.collectElementTypesForWidening(); CM.collectElementTypesForWidening();

// Plan how to best vectorize, return the best VF and its cost. // Plan how to best vectorize, return the best VF and its cost.

const VectorizationFactor VF = LVP.planInVPlanNativePath(UserVF); const VectorizationFactor VF = LVP.planInVPlanNativePath(UserVF);

// If we are stress testing VPlan builds, do not attempt to generate vector // If we are stress testing VPlan builds, do not attempt to generate vector

// code. Masked vector code generation support will follow soon. // code. Masked vector code generation support will follow soon.

// Also, do not attempt to vectorize if no vector code will be produced. // Also, do not attempt to vectorize if no vector code will be produced.

if (VPlanBuildStressTest || VectorizationFactor::Disabled() == VF) if (VPlanBuildStressTest || VectorizationFactor::Disabled() == VF)

return false; return false;

VPlan &BestPlan = LVP.getBestPlanFor(VF.Width); VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);

{ {

GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI,

F->getParent()->getDataLayout()); F->getParent()->getDataLayout());

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, 1, LVL, InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width,

&CM, BFI, PSI, Checks); VF.Width, 1, LVL, &CM, BFI, PSI, Checks);

LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \"" LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""

<< L->getHeader()->getParent()->getName() << "\"\n"); << L->getHeader()->getParent()->getName() << "\"\n");

LVP.executePlan(VF.Width, 1, BestPlan, LB, DT, false); LVP.executePlan(VF.Width, 1, BestPlan, LB, DT, false);

} }

// Mark the loop as already vectorized to avoid vectorizing again. // Mark the loop as already vectorized to avoid vectorizing again.

Hints.setAlreadyVectorized(); Hints.setAlreadyVectorized();

assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs())); assert(!verifyFunction(*L->getHeader()->getParent(), &dbgs()));

Show All 40 Lines if (isa<FPExtInst>(I) && EmittedRemark.insert(I).second)

}); });

for (Use &Op : I->operands()) for (Use &Op : I->operands())

if (auto *OpI = dyn_cast<Instruction>(Op)) if (auto *OpI = dyn_cast<Instruction>(Op))

Worklist.push_back(OpI); Worklist.push_back(OpI);

} }

static bool areRuntimeChecksProfitable(GeneratedRTChecks &Checks,

VectorizationFactor &VF, Loop *L,

ScalarEvolution &SE) {

InstructionCost CheckCost = Checks.getCost();

if (!CheckCost.isValid())

return false;

// When interleaving only scalar and vector cost will be equal, which in turn

// would lead to a divide by 0. Fall back to hard threshold.

if (VF.Width.isScalar()) {

if (CheckCost > VectorizeMemoryCheckThreshold) {

LLVM_DEBUG(

dmgreenUnsubmitted

Done

Is it worth making "100" a compiler option, so that it is not hardcoded? Could it reuse VectorizeMemoryCheckThreshold, even if it is a different unit?

dmgreen: Is it worth making "100" a compiler option, so that it is not hardcoded? Could it reuse…

fhahnAuthorUnsubmitted

Done

Good point, I updated the code to re-use the option.

fhahn: Good point, I updated the code to re-use the option.

dbgs()

<< "LV: Interleaving only is not profitable due to runtime checks\n");

return false;

}

return true;

}

// First, compute the minimum iteration count required so that the vector

// loop outperforms the scalar loop.

// The total cost of the scalar loop is

// ScalarC * TC

// where

// * TC is the actual trip count of the loop.

// * ScalarC is the cost of a single scalar iteration.

// The total cost of the vector loop is

// RtC + VecC * (TC / VF) + EpiC

// where

// * RtC is the cost of the generated runtime checks

// * VecC is the cost of a single vector iteration.

// * TC is the actual trip count of the loop

// * VF is the vectorization factor

// * EpiCost is the cost of the generated epilogue, including the cost

// of the remaining scalar operations.

// Vectorization is profitable once the total vector cost is less than the

// total scalar cost:

// RtC + VecC * (TC / VF) + EpiC < ScalarC * TC

// Now we can compute the minimum required trip count TC as

// (RtC + EpiC) / (ScalarC - (VecC / VF)) < TC

// For now we assume the epilogue cost EpiC = 0 for simplicity. Note that

// the computations are performed on doubles, not integers and the result

// is rounded up, hence we get an upper estimate of the TC.

unsigned IntVF = VF.Width.getKnownMinValue();

double ScalarC = *VF.ScalarCost.getValue();

double VecCOverVF = double(*VF.Cost.getValue()) / IntVF;

double RtC = *CheckCost.getValue();

double MinTC1 = RtC / (ScalarC - VecCOverVF);

// Second, compute a minimum iteration count so that the cost of the

// runtime checks is only a fraction of the total scalar loop cost. This

// adds a loop-dependent bound on the overhead incurred if the runtime

// checks fail. In case the runtime checks fail, the cost is RtC + ScalarC

// * TC. To bound the runtime check to be a fraction 1/X of the scalar

// cost, compute

// RtC < ScalarC * TC * (1 / X) ==> RtC * X / ScalarC < TC

double MinTC2 = RtC * 10 / ScalarC;

// Now pick the larger minimum. If it is not a multiple of VF, choose the

// next closest multiple of VF. This should partly compensate for ignoring

// the epilogue cost.

uint64_t MinTC = std::ceil(std::max(MinTC1, MinTC2));

VF.MinProfitableTripCount = ElementCount::getFixed(alignTo(MinTC, IntVF));

LLVM_DEBUG(

dbgs() << "LV: Minimum required TC for runtime checks to be profitable:"

<< VF.MinProfitableTripCount << "\n");

// Skip vectorization if the expected trip count is less than the minimum

// required trip count.

if (auto ExpectedTC = getSmallBestKnownTC(SE, L)) {

if (ElementCount::isKnownLT(ElementCount::getFixed(*ExpectedTC),

VF.MinProfitableTripCount)) {

LLVM_DEBUG(dbgs() << "LV: Vectorization is not beneficial: expected "

"trip count < minimum profitable VF ("

<< *ExpectedTC << " < " << VF.MinProfitableTripCount

<< ")\n");

return false;

}

return true;

}

LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts) LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)

: InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced || : InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced ||

!EnableLoopInterleaving), !EnableLoopInterleaving),

VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced || VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced ||

!EnableLoopVectorization) {} !EnableLoopVectorization) {}

bool LoopVectorizePass::processLoop(Loop *L) { bool LoopVectorizePass::processLoop(Loop *L) {

assert((EnableVPlanNativePath || L->isInnermost()) && assert((EnableVPlanNativePath || L->isInnermost()) &&

▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines #endif /* NDEBUG */

// Use the cost model. // Use the cost model.

LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE, LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE,

F, &Hints, IAI); F, &Hints, IAI);

CM.collectValuesToIgnore(); CM.collectValuesToIgnore();

CM.collectElementTypesForWidening(); CM.collectElementTypesForWidening();

// Use the planner for vectorization. // Use the planner for vectorization.

LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM, IAI, PSE, Hints, LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM, IAI, PSE, Hints, ORE);

Requirements, ORE);

// Get user vectorization factor and interleave count. // Get user vectorization factor and interleave count.

ElementCount UserVF = Hints.getWidth(); ElementCount UserVF = Hints.getWidth();

unsigned UserIC = Hints.getInterleave(); unsigned UserIC = Hints.getInterleave();

// Plan how to best vectorize, return the best VF and its cost. // Plan how to best vectorize, return the best VF and its cost.

Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC); Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC);

VectorizationFactor VF = VectorizationFactor::Disabled(); VectorizationFactor VF = VectorizationFactor::Disabled();

unsigned IC = 1; unsigned IC = 1;

GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI,

F->getParent()->getDataLayout()); F->getParent()->getDataLayout());

if (MaybeVF) { if (MaybeVF) {

if (LVP.requiresTooManyRuntimeChecks()) {

ORE->emit([&]() {

return OptimizationRemarkAnalysisAliasing(

DEBUG_TYPE, "CantReorderMemOps", L->getStartLoc(),

L->getHeader())

<< "loop not vectorized: cannot prove it is safe to reorder "

"memory operations";

});

LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");

Hints.emitRemarkWithHints();

return false;

}

VF = *MaybeVF; VF = *MaybeVF;

// Select the interleave count. // Select the interleave count.

IC = CM.selectInterleaveCount(VF.Width, *VF.Cost.getValue()); IC = CM.selectInterleaveCount(VF.Width, *VF.Cost.getValue());

unsigned SelectedIC = std::max(IC, UserIC); unsigned SelectedIC = std::max(IC, UserIC);

// Optimistically generate runtime checks if they are needed. Drop them if // Optimistically generate runtime checks if they are needed. Drop them if

// they turn out to not be profitable. // they turn out to not be profitable.

if (VF.Width.isVector() || SelectedIC > 1) if (VF.Width.isVector() || SelectedIC > 1)

Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, SelectedIC); Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, SelectedIC);

// Check if it is profitable to vectorize with runtime checks.

bool ForceVectorization =

Hints.getForce() == LoopVectorizeHints::FK_Enabled;

if (!ForceVectorization &&

!areRuntimeChecksProfitable(Checks, VF, L, *PSE.getSE()))

return false;

} }

// Identify the diagnostic messages that should be produced. // Identify the diagnostic messages that should be produced.

std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg; std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;

bool VectorizeLoop = true, InterleaveLoop = true; bool VectorizeLoop = true, InterleaveLoop = true;

if (VF.Width.isScalar()) { if (VF.Width.isScalar()) {

LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n"); LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");

VecDiagMsg = std::make_pair( VecDiagMsg = std::make_pair(

▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines if (!VectorizeLoop) {

LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV, LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV,

DT, true); DT, true);

++LoopsEpilogueVectorized; ++LoopsEpilogueVectorized;

if (!MainILV.areSafetyChecksAdded()) if (!MainILV.areSafetyChecksAdded())

DisableRuntimeUnroll = true; DisableRuntimeUnroll = true;

} else { } else {

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC, InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width,

&LVL, &CM, BFI, PSI, Checks); VF.MinProfitableTripCount, IC, &LVL, &CM, BFI,

PSI, Checks);

VPlan &BestPlan = LVP.getBestPlanFor(VF.Width); VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);

LVP.executePlan(VF.Width, IC, BestPlan, LB, DT, false); LVP.executePlan(VF.Width, IC, BestPlan, LB, DT, false);

++LoopsVectorized; ++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there // Add metadata to disable runtime unrolling a scalar loop when there

// are no runtime checks about strides and memory. A scalar loop that is // are no runtime checks about strides and memory. A scalar loop that is

// rarely used is not worth unrolling. // rarely used is not worth unrolling.

▲ Show 20 Lines • Show All 165 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll

; RUN: opt -loop-vectorize -mtriple=arm64-apple-iphoneos -vectorizer-min-trip-count=8 -S %s \| FileCheck %s		; RUN: opt -loop-vectorize -mtriple=arm64-apple-iphoneos -vectorizer-min-trip-count=8 -S %s \| FileCheck --check-prefixes=CHECK,DEFAULT %s
		; RUN: opt -loop-vectorize -mtriple=arm64-apple-iphoneos -vectorizer-min-trip-count=8 -vectorize-memory-check-threshold=1 -S %s \| FileCheck --check-prefixes=CHECK,THRESHOLD %s

; Tests for loops with large numbers of runtime checks. Check that loops are		; Tests for loops with large numbers of runtime checks. Check that loops are
; vectorized, if the loop trip counts are large and the impact of the runtime		; vectorized, if the loop trip counts are large and the impact of the runtime
; checks is very small compared to the expected loop runtimes.		; checks is very small compared to the expected loop runtimes.


; The trip count in the loop in this function is too to warrant large runtime checks.		; The trip count in the loop in this function is too to warrant large runtime checks.
; CHECK-LABEL: define {{.*}} @test_tc_too_small		; CHECK-LABEL: define {{.*}} @test_tc_too_small
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	loop: ; preds = %bb54, %bb37
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%cmp = icmp ult i64 %iv, 10		%cmp = icmp ult i64 %iv, 10
br i1 %cmp, label %loop, label %exit		br i1 %cmp, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

; FIXME
; The trip count in the loop in this function high enough to warrant large runtime checks.		; The trip count in the loop in this function high enough to warrant large runtime checks.
; CHECK-LABEL: define {{.*}} @test_tc_big_enough		; CHECK-LABEL: define {{.*}} @test_tc_big_enough
; CHECK-NOT: vector.memcheck		; DEFAULT: vector.memcheck
; CHECK-NOT: vector.body		; DEFAULT: vector.body
		; THRESHOLD-NOT: vector.memcheck
		; THRESHOLD-NOT: vector.body
		;
define void @test_tc_big_enough(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2) {		define void @test_tc_big_enough(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2) {
entry:		entry:
br label %loop		br label %loop

loop: ; preds = %bb54, %bb37		loop: ; preds = %bb54, %bb37
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv		%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv
%lv.1 = load i16, i16* %gep.1, align 2		%lv.1 = load i16, i16* %gep.1, align 2
Show All 34 Lines	loop: ; preds = %bb54, %bb37
br i1 %cmp, label %loop, label %exit		br i1 %cmp, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

define void @test_tc_unknown(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2, i64 %N) {		define void @test_tc_unknown(i16* %ptr.1, i16* %ptr.2, i16* %ptr.3, i16* %ptr.4, i64 %off.1, i64 %off.2, i64 %N) {
; CHECK-LABEL: define void @test_tc_unknown		; CHECK-LABEL: define void @test_tc_unknown
; CHECK-NOT: vector.memcheck		; DEFAULT: [[ADD:%.+]] = add i64 %N, 1
; CHECK-NOT: vector.body		; DEFAULT-NEXT: [[C:%.+]] = icmp ult i64 [[ADD]], 16
		; DEFAULT-NEXT: br i1 [[C]], label %scalar.ph, label %vector.memcheck
		; THRESHOLD-NOT: vector.memcheck
		; THRESHOLD-NOT: vector.body
;		;
entry:		entry:
br label %loop		br label %loop

loop: ; preds = %bb54, %bb37		loop: ; preds = %bb54, %bb37
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv		%gep.1 = getelementptr inbounds i16, i16* %ptr.1, i64 %iv
%lv.1 = load i16, i16* %gep.1, align 2		%lv.1 = load i16, i16* %gep.1, align 2
Show All 39 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

	; RUN: opt -S -loop-vectorize < %s \| FileCheck %s			; RUN: opt -S -loop-vectorize < %s \| FileCheck %s

	; These tests ensure that tail-folding is enabled when the predicate.enable			; These tests ensure that tail-folding is enabled when the predicate.enable
	; loop attribute is set to true.			; loop attribute is set to true.

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"


	define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {			define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
	; CHECK-LABEL: @simple_memset(			; CHECK-LABEL: @simple_memset(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)			; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
	; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll

	; RUN: opt -S -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize -force-vector-interleave=4 -force-vector-width=4 < %s \| FileCheck %s			; RUN: opt -S -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize -force-vector-interleave=4 -force-vector-width=4 < %s \| FileCheck %s
	; RUN: opt -S -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -force-vector-interleave=4 -force-vector-width=4 < %s \| FileCheck %s			; RUN: opt -S -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -force-vector-interleave=4 -force-vector-width=4 < %s \| FileCheck %s

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"


	define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {			define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
	; CHECK-LABEL: @simple_memset(			; CHECK-LABEL: @simple_memset(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)			; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16
	; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	while.end.loopexit: ; preds = %while.body			while.end.loopexit: ; preds = %while.body
	ret void			ret void
	}			}

	define void @cond_memset(i32 %val, i32* noalias readonly %cond_ptr, i32* noalias %ptr, i64 %n) #0 {			define void @cond_memset(i32 %val, i32* noalias readonly %cond_ptr, i32* noalias %ptr, i64 %n) #0 {
	; CHECK-LABEL: @cond_memset(			; CHECK-LABEL: @cond_memset(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)			; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 16
	; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
	; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16			; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 16
	; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16			; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 16
	; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1			; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
	▲ Show 20 Lines • Show All 122 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize -prefer-inloop-reductions < %s \| FileCheck %s		; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize -prefer-inloop-reductions < %s \| FileCheck %s
; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -prefer-inloop-reductions < %s \| FileCheck %s		; RUN: opt -S -hints-allow-reordering=false -loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -prefer-inloop-reductions < %s \| FileCheck %s

; CHECK-NOT: vector.body:		; CHECK-NOT: vector.body:

target triple = "aarch64-unknown-linux-gnu"		target triple = "aarch64-unknown-linux-gnu"


define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {		define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
; CHECK-LABEL: @simple_memset(		; CHECK-LABEL: @simple_memset(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)		; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
Show All 34 Lines	while.end.loopexit: ; preds = %while.body
ret void		ret void
}		}


define void @simple_memcpy(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {		define void @simple_memcpy(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {
; CHECK-LABEL: @simple_memcpy(		; CHECK-LABEL: @simple_memcpy(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)		; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines

define void @copy_stride4(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {		define void @copy_stride4(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {
; CHECK-LABEL: @copy_stride4(		; CHECK-LABEL: @copy_stride4(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 4)		; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 4)
; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[UMAX]], -1		; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[UMAX]], -1
; CHECK-NEXT: [[TMP1:%.*]] = lshr i64 [[TMP0]], 2		; CHECK-NEXT: [[TMP1:%.*]] = lshr i64 [[TMP0]], 2
; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1		; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
		; CHECK-NEXT: [[TMP5:%.*]] = sub i64 -1, [[TMP2]]
; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4		; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 4
; CHECK-NEXT: [[TMP5:%.*]] = sub i64 -1, [[TMP2]]
; CHECK-NEXT: [[TMP6:%.*]] = icmp ult i64 [[TMP5]], [[TMP4]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp ult i64 [[TMP5]], [[TMP4]]
; CHECK-NEXT: br i1 [[TMP6]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP6]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 4		; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 4
; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4		; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 4
; CHECK-NEXT: [[TMP11:%.*]] = sub i64 [[TMP10]], 1		; CHECK-NEXT: [[TMP11:%.*]] = sub i64 [[TMP10]], 1
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	while.end.loopexit: ; preds = %while.body
ret void		ret void
}		}


define void @simple_gather_scatter(i32* noalias %dst, i32* noalias %src, i32* noalias %ind, i64 %n) #0 {		define void @simple_gather_scatter(i32* noalias %dst, i32* noalias %src, i32* noalias %ind, i64 %n) #0 {
; CHECK-LABEL: @simple_gather_scatter(		; CHECK-LABEL: @simple_gather_scatter(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)		; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
}		}


; The original loop had an unconditional uniform load. Let's make sure		; The original loop had an unconditional uniform load. Let's make sure
; we don't artificially create new predicated blocks for the load.		; we don't artificially create new predicated blocks for the load.
define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 {		define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 {
; CHECK-LABEL: @uniform_load(		; CHECK-LABEL: @uniform_load(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
Show All 40 Lines

; The original loop had a conditional uniform load. In this case we actually		; The original loop had a conditional uniform load. In this case we actually
; do need to perform conditional loads and so we end up using a gather instead.		; do need to perform conditional loads and so we end up using a gather instead.
; However, we at least ensure the mask is the overlap of the loop predicate		; However, we at least ensure the mask is the overlap of the loop predicate
; and the original condition.		; and the original condition.
define void @cond_uniform_load(i32* noalias %dst, i32* noalias readonly %src, i32* noalias readonly %cond, i64 %n) #0 {		define void @cond_uniform_load(i32* noalias %dst, i32* noalias readonly %src, i32* noalias readonly %cond, i64 %n) #0 {
; CHECK-LABEL: @cond_uniform_load(		; CHECK-LABEL: @cond_uniform_load(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines
}		}


; The original loop had an unconditional uniform store. Let's make sure		; The original loop had an unconditional uniform store. Let's make sure
; we don't artificially create new predicated blocks for the load.		; we don't artificially create new predicated blocks for the load.
define void @uniform_store(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 {		define void @uniform_store(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 {
; CHECK-LABEL: @uniform_store(		; CHECK-LABEL: @uniform_store(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
Show All 37 Lines	for.end: ; preds = %for.body, %entry
ret void		ret void
}		}


define void @simple_fdiv(float* noalias %dst, float* noalias %src, i64 %n) #0 {		define void @simple_fdiv(float* noalias %dst, float* noalias %src, i64 %n) #0 {
; CHECK-LABEL: @simple_fdiv(		; CHECK-LABEL: @simple_fdiv(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)		; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	while.end.loopexit: ; preds = %while.body
ret void		ret void
}		}


define i32 @add_reduction_i32(i32* %ptr, i64 %n) #0 {		define i32 @add_reduction_i32(i32* %ptr, i64 %n) #0 {
; CHECK-LABEL: @add_reduction_i32(		; CHECK-LABEL: @add_reduction_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)		; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
Show All 37 Lines
while.end.loopexit: ; preds = %while.body		while.end.loopexit: ; preds = %while.body
ret i32 %red.next		ret i32 %red.next
}		}

define float @add_reduction_f32(float* %ptr, i64 %n) #0 {		define float @add_reduction_f32(float* %ptr, i64 %n) #0 {
; CHECK-LABEL: @add_reduction_f32(		; CHECK-LABEL: @add_reduction_f32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)		; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
		; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.*]] = sub i64 -1, [[UMAX]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
Show All 35 Lines

while.end.loopexit: ; preds = %while.body		while.end.loopexit: ; preds = %while.body
ret float %red.next		ret float %red.next
}		}

define i32 @cond_xor_reduction(i32* noalias %a, i32* noalias %cond, i64 %N) #0 {		define i32 @cond_xor_reduction(i32* noalias %a, i32* noalias %cond, i64 %N) #0 {
; CHECK-LABEL: @cond_xor_reduction(		; CHECK-LABEL: @cond_xor_reduction(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
; CHECK-NEXT: [[TMP2:%.]] = sub i64 -1, [[N:%.]]
; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]		; CHECK-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP2]], [[TMP1]]
; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4		; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 4
; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4		; CHECK-NEXT: [[TMP7:%.*]] = mul i64 [[TMP6]], 4
; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1		; CHECK-NEXT: [[TMP8:%.*]] = sub i64 [[TMP7]], 1
▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll

	Show First 20 Lines • Show All 857 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: br i1 [[CMP_NOT10]], label [[FOR_END:%.]], label [[FOR_BODY_LR_PH:%.]]			; AVX512-NEXT: br i1 [[CMP_NOT10]], label [[FOR_END:%.]], label [[FOR_BODY_LR_PH:%.]]
	; AVX512: for.body.lr.ph:			; AVX512: for.body.lr.ph:
	; AVX512-NEXT: [[MUL:%.*]] = sub nsw i32 0, [[D]]			; AVX512-NEXT: [[MUL:%.*]] = sub nsw i32 0, [[D]]
	; AVX512-NEXT: [[IDXPROM:%.*]] = sext i32 [[MUL]] to i64			; AVX512-NEXT: [[IDXPROM:%.*]] = sext i32 [[MUL]] to i64
	; AVX512-NEXT: [[TMP0:%.*]] = shl nsw i64 [[IDX_EXT]], 2			; AVX512-NEXT: [[TMP0:%.*]] = shl nsw i64 [[IDX_EXT]], 2
	; AVX512-NEXT: [[TMP1:%.*]] = add nsw i64 [[TMP0]], -4			; AVX512-NEXT: [[TMP1:%.*]] = add nsw i64 [[TMP0]], -4
	; AVX512-NEXT: [[TMP2:%.*]] = lshr i64 [[TMP1]], 2			; AVX512-NEXT: [[TMP2:%.*]] = lshr i64 [[TMP1]], 2
	; AVX512-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1			; AVX512-NEXT: [[TMP3:%.*]] = add nuw nsw i64 [[TMP2]], 1
	; AVX512-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 16			; AVX512-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP3]], 32
	; AVX512-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]			; AVX512-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
	; AVX512: vector.memcheck:			; AVX512: vector.memcheck:
	; AVX512-NEXT: [[TMP4:%.*]] = shl nsw i64 [[IDX_EXT]], 2			; AVX512-NEXT: [[TMP4:%.*]] = shl nsw i64 [[IDX_EXT]], 2
	; AVX512-NEXT: [[TMP5:%.*]] = add nsw i64 [[TMP4]], -4			; AVX512-NEXT: [[TMP5:%.*]] = add nsw i64 [[TMP4]], -4
	; AVX512-NEXT: [[TMP6:%.*]] = lshr i64 [[TMP5]], 2			; AVX512-NEXT: [[TMP6:%.*]] = lshr i64 [[TMP5]], 2
	; AVX512-NEXT: [[TMP7:%.*]] = shl i64 [[TMP6]], 4			; AVX512-NEXT: [[TMP7:%.*]] = shl i64 [[TMP6]], 4
	; AVX512-NEXT: [[TMP8:%.*]] = add nuw nsw i64 [[TMP7]], 2			; AVX512-NEXT: [[TMP8:%.*]] = add nuw nsw i64 [[TMP7]], 2
	; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[DEST]], i64 [[TMP8]]			; AVX512-NEXT: [[SCEVGEP:%.]] = getelementptr float, float [[DEST]], i64 [[TMP8]]
	▲ Show 20 Lines • Show All 198 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/pointer-runtime-checks-unprofitable.ll

	; REQUIRES: asserts			; REQUIRES: asserts

	; RUN: opt -runtime-memory-check-threshold=9 -passes='loop-vectorize' -mtriple=x86_64-unknown-linux -S -debug %s 2>&1 \| FileCheck %s			; RUN: opt -passes='loop-vectorize' -mtriple=x86_64-unknown-linux -S -debug %s 2>&1 \| FileCheck %s

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

	target triple = "x86_64-unknown-linux"			target triple = "x86_64-unknown-linux"

	declare double @llvm.pow.f64(double, double)			declare double @llvm.pow.f64(double, double)

	; Test case where the memory runtime checks and vector body is more expensive			; Test case where the memory runtime checks and vector body is more expensive
	; than running the scalar loop.			; than running the scalar loop.
	; TODO: should not be vectorized.
	define void @test(double* nocapture %A, double* nocapture %B, double* nocapture %C, double* nocapture %D, double* nocapture %E) {			define void @test(double* nocapture %A, double* nocapture %B, double* nocapture %C, double* nocapture %D, double* nocapture %E) {

				; CHECK: Calculating cost of runtime checks:
				; CHECK-NEXT: 0 for {{.+}} = getelementptr double, double* %A, i64 16
				; CHECK-NEXT: 0 for {{.+}} = bitcast double*
				; CHECK-NEXT: 0 for {{.+}} = getelementptr double, double* %B, i64 16
				; CHECK-NEXT: 0 for {{.+}} = bitcast double*
				; CHECK-NEXT: 0 for {{.+}} = getelementptr double, double* %E, i64 16
				; CHECK-NEXT: 0 for {{.+}} = bitcast double*
				; CHECK-NEXT: 0 for {{.+}} = getelementptr double, double* %C, i64 16
				; CHECK-NEXT: 0 for {{.+}} = bitcast double*
				; CHECK-NEXT: 0 for {{.+}} = getelementptr double, double* %D, i64 16
				; CHECK-NEXT: 0 for {{.+}} = bitcast double*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = icmp ult i8*
				; CHECK-NEXT: 1 for {{.+}} = and i1
				; CHECK-NEXT: 1 for {{.+}} = or i1
				; CHECK-NEXT: Total cost of runtime checks: 35

				; CHECK: LV: Vectorization is not beneficial: expected trip count < minimum profitable VF (16 < 70)
				;
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK: vector.memcheck			; CHECK-NEXT: entry:
	; CHECK: vector.body			; CHECK-NEXT: br label %for.body
				; CHECK-NOT: vector.memcheck
				; CHECK-NOT: vector.body
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%gep.A = getelementptr inbounds double, double* %A, i64 %iv			%gep.A = getelementptr inbounds double, double* %A, i64 %iv
	%l.A = load double, double* %gep.A, align 4			%l.A = load double, double* %gep.A, align 4
	Show All 28 Lines

llvm/test/Transforms/LoopVectorize/X86/pr23997.ll

	Show All 9 Lines
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[PREHEADER:%.*]]			; CHECK-NEXT: br label [[PREHEADER:%.*]]
	; CHECK: preheader:			; CHECK: preheader:
	; CHECK-NEXT: [[DOT10:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP0:%.*]], i64 16			; CHECK-NEXT: [[DOT10:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP0:%.*]], i64 16
	; CHECK-NEXT: [[DOT11:%.]] = bitcast i8 addrspace(1) [[DOT10]] to i8 addrspace(1)* addrspace(1)*			; CHECK-NEXT: [[DOT11:%.]] = bitcast i8 addrspace(1) [[DOT10]] to i8 addrspace(1)* addrspace(1)*
	; CHECK-NEXT: [[DOT12:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP1:%.*]], i64 16			; CHECK-NEXT: [[DOT12:%.]] = getelementptr inbounds i8, i8 addrspace(1) [[TMP1:%.*]], i64 16
	; CHECK-NEXT: [[DOT13:%.]] = bitcast i8 addrspace(1) [[DOT12]] to i8 addrspace(1)* addrspace(1)*			; CHECK-NEXT: [[DOT13:%.]] = bitcast i8 addrspace(1) [[DOT12]] to i8 addrspace(1)* addrspace(1)*
	; CHECK-NEXT: [[UMAX2:%.]] = call i64 @llvm.umax.i64(i64 [[TMP2:%.]], i64 1)			; CHECK-NEXT: [[UMAX2:%.]] = call i64 @llvm.umax.i64(i64 [[TMP2:%.]], i64 1)
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[UMAX2]], 16			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[UMAX2]], 20
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
	; CHECK: vector.memcheck:			; CHECK: vector.memcheck:
	; CHECK-NEXT: [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP2]], i64 1)			; CHECK-NEXT: [[UMAX:%.*]] = call i64 @llvm.umax.i64(i64 [[TMP2]], i64 1)
	; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[UMAX]], 3			; CHECK-NEXT: [[TMP3:%.*]] = shl i64 [[UMAX]], 3
	; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[TMP3]], 16			; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[TMP3]], 16
	; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i8, i8 addrspace(1) [[TMP0]], i64 [[TMP4]]			; CHECK-NEXT: [[SCEVGEP:%.]] = getelementptr i8, i8 addrspace(1) [[TMP0]], i64 [[TMP4]]
	; CHECK-NEXT: [[SCEVGEP1:%.]] = getelementptr i8, i8 addrspace(1) [[TMP1]], i64 [[TMP4]]			; CHECK-NEXT: [[SCEVGEP1:%.]] = getelementptr i8, i8 addrspace(1) [[TMP1]], i64 [[TMP4]]
	; CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 addrspace(1) [[DOT10]], [[SCEVGEP1]]			; CHECK-NEXT: [[BOUND0:%.]] = icmp ult i8 addrspace(1) [[DOT10]], [[SCEVGEP1]]
	▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/pr35432.ll

	Show All 34 Lines
	; CHECK: for.body8.lr.ph:			; CHECK: for.body8.lr.ph:
	; CHECK-NEXT: [[CONV3:%.*]] = trunc i32 [[STOREMERGE_IN9]] to i8			; CHECK-NEXT: [[CONV3:%.*]] = trunc i32 [[STOREMERGE_IN9]] to i8
	; CHECK-NEXT: [[DOTPROMOTED:%.]] = load i32, i32 getelementptr inbounds ([192 x [192 x i32]], [192 x [192 x i32]]* @a, i64 0, i64 0, i64 0), align 16			; CHECK-NEXT: [[DOTPROMOTED:%.]] = load i32, i32 getelementptr inbounds ([192 x [192 x i32]], [192 x [192 x i32]]* @a, i64 0, i64 0, i64 0), align 16
	; CHECK-NEXT: [[TMP3:%.*]] = add i8 [[CONV3]], -1			; CHECK-NEXT: [[TMP3:%.*]] = add i8 [[CONV3]], -1
	; CHECK-NEXT: [[TMP4:%.*]] = zext i8 [[TMP3]] to i32			; CHECK-NEXT: [[TMP4:%.*]] = zext i8 [[TMP3]] to i32
	; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[TMP4]], 1			; CHECK-NEXT: [[TMP5:%.*]] = add i32 [[TMP4]], 1
	; CHECK-NEXT: [[UMIN1:%.*]] = call i32 @llvm.umin.i32(i32 [[TMP2]], i32 [[TMP4]])			; CHECK-NEXT: [[UMIN1:%.*]] = call i32 @llvm.umin.i32(i32 [[TMP2]], i32 [[TMP4]])
	; CHECK-NEXT: [[TMP6:%.*]] = sub i32 [[TMP5]], [[UMIN1]]			; CHECK-NEXT: [[TMP6:%.*]] = sub i32 [[TMP5]], [[UMIN1]]
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP6]], 8			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP6]], 32
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]
	; CHECK: vector.scevcheck:			; CHECK: vector.scevcheck:
	; CHECK-NEXT: [[TMP7:%.*]] = add i8 [[CONV3]], -1			; CHECK-NEXT: [[TMP7:%.*]] = add i8 [[CONV3]], -1
	; CHECK-NEXT: [[TMP8:%.*]] = zext i8 [[TMP7]] to i32			; CHECK-NEXT: [[TMP8:%.*]] = zext i8 [[TMP7]] to i32
	; CHECK-NEXT: [[UMIN:%.*]] = call i32 @llvm.umin.i32(i32 [[TMP2]], i32 [[TMP8]])			; CHECK-NEXT: [[UMIN:%.*]] = call i32 @llvm.umin.i32(i32 [[TMP2]], i32 [[TMP8]])
	; CHECK-NEXT: [[TMP9:%.*]] = sub i32 [[TMP8]], [[UMIN]]			; CHECK-NEXT: [[TMP9:%.*]] = sub i32 [[TMP8]], [[UMIN]]
	; CHECK-NEXT: [[TMP10:%.*]] = trunc i32 [[TMP9]] to i8			; CHECK-NEXT: [[TMP10:%.*]] = trunc i32 [[TMP9]] to i8
	; CHECK-NEXT: [[MUL:%.*]] = call { i8, i1 } @llvm.umul.with.overflow.i8(i8 1, i8 [[TMP10]])			; CHECK-NEXT: [[MUL:%.*]] = call { i8, i1 } @llvm.umul.with.overflow.i8(i8 1, i8 [[TMP10]])
	▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/pr54634.ll

	Show All 16 Lines
	; CHECK-NEXT: [[TMP8:%.]] = addrspacecast { {} addrspace(10), i64 } addrspace(13)* addrspace(10)* [[TMP7]] to { {} addrspace(10), i64 } addrspace(13) addrspace(11)*			; CHECK-NEXT: [[TMP8:%.]] = addrspacecast { {} addrspace(10), i64 } addrspace(13)* addrspace(10)* [[TMP7]] to { {} addrspace(10), i64 } addrspace(13) addrspace(11)*
	; CHECK-NEXT: [[TMP9:%.]] = load { {} addrspace(10), i64 } addrspace(13), { {} addrspace(10), i64 } addrspace(13)* addrspace(11)* [[TMP8]], align 8, !tbaa [[TBAA5:![0-9]+]]			; CHECK-NEXT: [[TMP9:%.]] = load { {} addrspace(10), i64 } addrspace(13), { {} addrspace(10), i64 } addrspace(13)* addrspace(11)* [[TMP8]], align 8, !tbaa [[TBAA5:![0-9]+]]
	; CHECK-NEXT: [[TMP10:%.]] = bitcast { {} addrspace(10), i64 } addrspace(13)* [[TMP9]] to i8 addrspace(13)*			; CHECK-NEXT: [[TMP10:%.]] = bitcast { {} addrspace(10), i64 } addrspace(13)* [[TMP9]] to i8 addrspace(13)*
	; CHECK-NEXT: [[DOTELT:%.]] = getelementptr inbounds { {} addrspace(10), i64 }, { {} addrspace(10), i64 } addrspace(10) [[TMP6]], i64 0, i32 0			; CHECK-NEXT: [[DOTELT:%.]] = getelementptr inbounds { {} addrspace(10), i64 }, { {} addrspace(10), i64 } addrspace(10) [[TMP6]], i64 0, i32 0
	; CHECK-NEXT: [[DOTUNPACK:%.]] = load {} addrspace(10), {} addrspace(10)* addrspace(10)* [[DOTELT]], align 8, !tbaa [[TBAA8:![0-9]+]]			; CHECK-NEXT: [[DOTUNPACK:%.]] = load {} addrspace(10), {} addrspace(10)* addrspace(10)* [[DOTELT]], align 8, !tbaa [[TBAA8:![0-9]+]]
	; CHECK-NEXT: [[DOTELT1:%.]] = getelementptr inbounds { {} addrspace(10), i64 }, { {} addrspace(10), i64 } addrspace(10) [[TMP6]], i64 0, i32 1			; CHECK-NEXT: [[DOTELT1:%.]] = getelementptr inbounds { {} addrspace(10), i64 }, { {} addrspace(10), i64 } addrspace(10) [[TMP6]], i64 0, i32 1
	; CHECK-NEXT: [[DOTUNPACK2:%.]] = load i64, i64 addrspace(10) [[DOTELT1]], align 8, !tbaa [[TBAA8]]			; CHECK-NEXT: [[DOTUNPACK2:%.]] = load i64, i64 addrspace(10) [[DOTELT1]], align 8, !tbaa [[TBAA8]]
	; CHECK-NEXT: [[TMP11:%.*]] = add nsw i64 [[TMP2]], 1			; CHECK-NEXT: [[TMP11:%.*]] = add nsw i64 [[TMP2]], 1
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP11]], 16			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP11]], 28
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_SCEVCHECK:%.]]
	; CHECK: vector.scevcheck:			; CHECK: vector.scevcheck:
	; CHECK-NEXT: [[MUL:%.*]] = call { i64, i1 } @llvm.umul.with.overflow.i64(i64 16, i64 [[TMP2]])			; CHECK-NEXT: [[MUL:%.*]] = call { i64, i1 } @llvm.umul.with.overflow.i64(i64 16, i64 [[TMP2]])
	; CHECK-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i64, i1 } [[MUL]], 0			; CHECK-NEXT: [[MUL_RESULT:%.*]] = extractvalue { i64, i1 } [[MUL]], 0
	; CHECK-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i64, i1 } [[MUL]], 1			; CHECK-NEXT: [[MUL_OVERFLOW:%.*]] = extractvalue { i64, i1 } [[MUL]], 1
	; CHECK-NEXT: [[TMP12:%.*]] = sub i64 0, [[MUL_RESULT]]			; CHECK-NEXT: [[TMP12:%.*]] = sub i64 0, [[MUL_RESULT]]
	; CHECK-NEXT: [[TMP13:%.]] = getelementptr i8, i8 addrspace(13) [[TMP10]], i64 [[MUL_RESULT]]			; CHECK-NEXT: [[TMP13:%.]] = getelementptr i8, i8 addrspace(13) [[TMP10]], i64 [[MUL_RESULT]]
	; CHECK-NEXT: [[TMP14:%.]] = icmp ult i8 addrspace(13) [[TMP13]], [[TMP10]]			; CHECK-NEXT: [[TMP14:%.]] = icmp ult i8 addrspace(13) [[TMP13]], [[TMP10]]
	▲ Show 20 Lines • Show All 122 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/runtime-limit.ll

; RUN: opt < %s -loop-vectorize -dce -instcombine -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-missed=loop-vectorize -S 2>&1 \| FileCheck %s -check-prefix=OVERRIDE		; RUN: opt < %s -loop-vectorize -dce -instcombine -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-missed=loop-vectorize -S 2>&1 \| FileCheck %s
; RUN: opt < %s -loop-vectorize -pragma-vectorize-memory-check-threshold=6 -dce -instcombine -pass-remarks=loop-vectorize -pass-remarks-analysis=loop-vectorize -pass-remarks-missed=loop-vectorize -S 2>&1 \| FileCheck %s

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

target triple = "x86_64-unknown-linux"		target triple = "x86_64-unknown-linux"

; First loop produced diagnostic pass remark.		; First loop produced diagnostic pass remark.
;CHECK: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 2)		;CHECK: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 2)
; Second loop produces diagnostic analysis remark.		; Second loop produces diagnostic analysis remark.
;CHECK: remark: {{.*}}:0:0: loop not vectorized: cannot prove it is safe to reorder memory operations		;CHECK: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 1)

; First loop produced diagnostic pass remark.
;OVERRIDE: remark: {{.*}}:0:0: vectorized loop (vectorization width: 4, interleaved count: 2)
; Second loop produces diagnostic pass remark.
;OVERRIDE: remark: {{.*}}:0:0: loop not vectorized: cannot prove it is safe to reorder memory operations

; We are vectorizing with 6 runtime checks.		; We are vectorizing with 6 runtime checks.
;CHECK-LABEL: func1x6(		;CHECK-LABEL: func1x6(
;CHECK: <4 x i32>		;CHECK: <4 x i32>
;CHECK: ret		;CHECK: ret
;OVERRIDE-LABEL: func1x6(
;OVERRIDE: <4 x i32>
;OVERRIDE: ret
define i32 @func1x6(i32* nocapture %out, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {		define i32 @func1x6(i32* nocapture %out, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {
entry:		entry:
br label %for.body		br label %for.body

for.body: ; preds = %for.body, %entry		for.body: ; preds = %for.body, %entry
%i.016 = phi i64 [ 0, %entry ], [ %inc, %for.body ]		%i.016 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds i32, i32* %A, i64 %i.016		%arrayidx = getelementptr inbounds i32, i32* %A, i64 %i.016
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
Show All 14 Lines	for.body: ; preds = %for.body, %entry
%inc = add i64 %i.016, 1		%inc = add i64 %i.016, 1
%exitcond = icmp eq i64 %inc, 256		%exitcond = icmp eq i64 %inc, 256
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret i32 undef		ret i32 undef
}		}

; We are not vectorizing with 12 runtime checks.		; We are vectorizing with 12 runtime checks.
;CHECK-LABEL: func2x6(		;CHECK-LABEL: func2x6(
;CHECK-NOT: <4 x i32>		;CHECK: <4 x i32>
;CHECK: ret		;CHECK: ret
; We vectorize with 12 checks if a vectorization hint is provided.
;OVERRIDE-LABEL: func2x6(
;OVERRIDE-NOT: <4 x i32>
;OVERRIDE: ret
define i32 @func2x6(i32* nocapture %out, i32* nocapture %out2, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {		define i32 @func2x6(i32* nocapture %out, i32* nocapture %out2, i32* nocapture %A, i32* nocapture %B, i32* nocapture %C, i32* nocapture %D, i32* nocapture %E, i32* nocapture %F) {
entry:		entry:
br label %for.body		br label %for.body

for.body: ; preds = %for.body, %entry		for.body: ; preds = %for.body, %entry
%i.037 = phi i64 [ 0, %entry ], [ %inc, %for.body ]		%i.037 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
%arrayidx = getelementptr inbounds i32, i32* %A, i64 %i.037		%arrayidx = getelementptr inbounds i32, i32* %A, i64 %i.037
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
Show All 24 Lines	for.body: ; preds = %for.body, %entry
store i32 %add17, i32* %arrayidx18, align 4		store i32 %add17, i32* %arrayidx18, align 4
%inc = add i64 %i.037, 1		%inc = add i64 %i.037, 1
%exitcond = icmp eq i64 %inc, 256		%exitcond = icmp eq i64 %inc, 256
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret i32 undef		ret i32 undef
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorize cases with larger number of RT checks, execute only if profitable.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 442093

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-unroll.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll

llvm/test/Transforms/LoopVectorize/X86/pointer-runtime-checks-unprofitable.ll

llvm/test/Transforms/LoopVectorize/X86/pr23997.ll

llvm/test/Transforms/LoopVectorize/X86/pr35432.ll

llvm/test/Transforms/LoopVectorize/X86/pr54634.ll

llvm/test/Transforms/LoopVectorize/X86/runtime-limit.ll

[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
ClosedPublic