This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/2
LoopVectorizationLegality.cpp
1
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1
runtime-check-size-based-threshold.ll

Differential D75981

[LV] Create RT checks once VF/IC are selected, track scalar cost.
ClosedPublic

Authored by fhahn on Mar 11 2020, 4:05 AM.

Download Raw Diff

Details

Reviewers

rengolin
Ayal
gilr
hsaito
anemet
lebedev.ri
dmgreen

Commits

rGcb69ba4faaf1: [LV] Create RT checks once VF/IC are selected, track scalar cost.

Summary

This patch updates LV to generate runtime after the VF & IC are selected. It
allows deciding whether to vectorize with runtime checks or not based on
their cost compared to the vector loop.

It also updates VectorizationFactor to include the scalar cost.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Mar 11 2020, 4:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2020, 4:05 AM

Herald added subscribers: rkruppe, hiraditya. · View Herald Transcript

fhahn mentioned this in D71053: [LV] Take overhead of run-time checks into account during vectorization..Mar 11 2020, 4:12 AM

fhahn added a parent revision: D75980: [LV] Generate RT checks up-front and remove them if required..Mar 11 2020, 4:35 AM

fhahn edited the summary of this revision. (Show Details)Mar 11 2020, 4:42 AM

Harbormaster failed remote builds in B48794: Diff 249581!Mar 11 2020, 5:44 AM

lebedev.ri added inline comments.Mar 11 2020, 3:19 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9308	Please make 0.005 an option

Reverse-ping, thanks.
Anything to get this going?

lebedev.ri added a reviewer: lebedev.ri.Jul 21 2020, 12:03 PM

rebase.

With the linked dependent patches, this should now successfully build test-suite with MultiSource/SPEC2000/SPEC2006.

This leads to additional vectorization with runtime checks in a few more cases:

Same hash: 223 (filtered out)
Remaining: 14
Metric: loop-vectorize.LoopsVectorized

Program patch1 patch2 diff
test-suite...Source/Benchmarks/sim/sim.test 5.00 8.00 60.0%
test-suite...rks/FreeBench/pifft/pifft.test 33.00 47.00 42.4%
test-suite...chmarks/Rodinia/srad/srad.test 3.00 4.00 33.3%
test-suite...CFP2000/177.mesa/177.mesa.test 379.00 417.00 10.0%
test-suite...CI_Purple/SMG2000/smg2000.test 78.00 84.00 7.7%
test-suite...pps-C/SimpleMOC/SimpleMOC.test 39.00 42.00 7.7%
test-suite...oxyApps-C/miniGMG/miniGMG.test 42.00 44.00 4.8%
test-suite.../CINT2000/176.gcc/176.gcc.test 97.00 100.00 3.1%
test-suite...006/450.soplex/450.soplex.test 88.00 90.00 2.3%
test-suite...lications/ClamAV/clamscan.test 91.00 93.00 2.2%
test-suite...pplications/oggenc/oggenc.test 130.00 132.00 1.5%
test-suite...006/447.dealII/447.dealII.test 958.00 970.00 1.3%

Harbormaster completed remote builds in B66944: Diff 282932.Aug 4 2020, 8:44 AM

xbolva00 added a subscriber: xbolva00.Aug 15 2020, 3:59 PM

Rebased on top of current trunk. This version now can build MultiSource/SPEC2006/SPEC2000 with -O3 -flto without crashing.

The current version leads to a few more vectorized loops in some benchmarks:

Tests: 236
Same hash: 200 (filtered out)
Remaining: 36
Metric: loop-vectorize.LoopsVectorized

Program                                        base   patch.lv-mem-cost diff
 test-suite...Source/Benchmarks/sim/sim.test     5.00   8.00            60.0%
 test-suite...chmarks/Rodinia/srad/srad.test     3.00   4.00            33.3%
 test-suite...rks/FreeBench/pifft/pifft.test    33.00  43.00            30.3%
 test-suite...CFP2000/177.mesa/177.mesa.test   386.00 424.00             9.8%
 test-suite...CI_Purple/SMG2000/smg2000.test    77.00  83.00             7.8%
 test-suite...pps-C/SimpleMOC/SimpleMOC.test    39.00  42.00             7.7%
 test-suite...oxyApps-C/miniGMG/miniGMG.test    44.00  46.00             4.5%
 test-suite.../CINT2000/176.gcc/176.gcc.test    99.00 102.00             3.0%
 test-suite...006/450.soplex/450.soplex.test    88.00  90.00             2.3%
 test-suite...lications/ClamAV/clamscan.test    97.00  99.00             2.1%
 test-suite...pplications/oggenc/oggenc.test   151.00 153.00             1.3%
 test-suite...006/447.dealII/447.dealII.test   970.00 982.00             1.2%

Harbormaster completed remote builds in B83877: Diff 314337.Jan 4 2021, 2:13 AM

rkruppe removed a subscriber: rkruppe.Jan 4 2021, 2:53 AM

rebase on top of the recent changes.

Harbormaster completed remote builds in B84454: Diff 315353.Jan 8 2021, 5:01 AM

rebase

Harbormaster completed remote builds in B89374: Diff 323978.Feb 16 2021, 6:01 AM

Rebased after recent changes to D75980

Harbormaster completed remote builds in B89928: Diff 324996.Feb 19 2021, 8:40 AM

fhahn mentioned this in rG0cb9d8acbccb: [LV] Add test cases that require a larger number of RT checks..Mar 2 2021, 2:50 AM

Ping.

All dependent patches have been submitted now. I also pre-committed a simplified version of the test cases in 0cb9d8acbccb.

The update also adds an option to control the threshold.

Harbormaster completed remote builds in B91538: Diff 327404.Mar 2 2021, 3:33 AM

xbolva00 added inline comments.Mar 2 2021, 3:42 AM

llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll
9	is too small?

Fix wording in test comments, thanks!

Harbormaster completed remote builds in B91541: Diff 327407.Mar 2 2021, 3:49 AM

You better remove (WIP) suffix if this patch is ready for review. Otherwise people may think it still in progress...

lebedev.ri added inline comments.Mar 4 2021, 3:51 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270	Hm, why does `PragmaThresholdReached` not check for `Hints.allowReordering()`? If i really asked for loop to be vectorized, why are there further limits on the sanity checks? Or, for that matter, doesn't forcing vectorization disable those checks in the first place?

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

In D75981#2605728, @ebrevnov wrote:

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

That's a good point. I initially tried to keep things as closely modeled to the original code. I think the way the number of runtime checks is handled in LoopVectorizationRequirements is not ideal and makes things more difficult to follow. I am also not sure why those checks are handled separately (in doesNotMeet). I tried to see if we can remove doesNotMeet and instead move the checks at an earlier and more appropriate place.

I put up D98634 and D98633 to remove doesNotMeet, which moved the decision whether to vectorize with RT checks to LVP::plan(). I also updated this patch to make the cost-based decision in LVP::plan. From there it should be easy to adjust it further to use it for more cost-based decisions, as in your patch. Is this more in line what you had in mind?

Herald added a subscriber: bmahjour. · View Herald TranscriptMar 15 2021, 8:48 AM

Harbormaster completed remote builds in B93828: Diff 330679.Mar 15 2021, 8:49 AM

fhahn added inline comments.Mar 15 2021, 8:54 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270	I can't really comment much on why the existing code does what it does. But it is indeed surprising that runtime checks can block vectorization if a width is explicitly set. I put up D98634 and D98633 to move the `doesNotMeet` restrictions to what seems more appropriate places to me. The updated code also skips the checks if an explicit VF is forced by the user. I updated the code in this patch to always use the cost-based check if a constant TC is available here.

fhahn retitled this revision from [LV] Allow large RT checks, if they are a fraction of the scalar cost (WIP) to [LV] Allow large RT checks, if they are a fraction of the scalar cost..Mar 29 2021, 9:31 AM

Rebase & ping.

All dependent patches have been submitted and the decision is now made during planning.

Harbormaster completed remote builds in B96141: Diff 333908.Mar 29 2021, 9:32 AM

LGTM.
As with other patches, i'm not the best reviewer for this, but clearly others are otherwise preoccupied.

This isn't the final form, i think this part is reasonably safe step.
Things can and will be adjusted further later.

This revision is now accepted and ready to land.Mar 29 2021, 9:39 AM

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

In D75981#2702505, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

In D75981#2722103, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2702505, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

In D75981#2732814, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2722103, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2702505, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

reverse-ping, thanks.

Futile weekly reverse-ping, thanks (:

Reverse ping, thanks.

I finally had some time to rebase this change and fix the fallout.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

Harbormaster completed remote builds in B111038: Diff 354555.Jun 25 2021, 11:00 AM

Bump :)

In D75981#2841271, @fhahn wrote:

I finally had some time to rebase this change and fix the fallout.

Hurray!
It would be good to finally have this resolved.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

reverse ping, thanks

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

It is *really* sad to see these two patches to be stuck :(

In D75981#2903756, @ebrevnov wrote:

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

I think the goals of these two patches are largely correlated.

I think the main problem is that it isn't quite obvious why hard cut-offs on the runtime check complexity exist.
I guess, to not generate some very large and ridiculous checks.
But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

Reverse ping, thanks.

I think the main problem is that it isn't quite obvious why hard cut-offs on the runtime check complexity exist.
I guess, to not generate some very large and ridiculous checks.

I believe those cut-offs exist for the single reason. There was no way to calculate "real" cost of SCEV generated instructions. Now there is support for that and we can/should simply take cost of runtime checks into account in cost model.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Honestly in the end i'm not sure i know which approach is best, i just want to see this finally fixed :/

Rebased again after recent changes.

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

The way I see it the current patch is already using a cost-model based approach: it already computes the cost of the runtime checks and the cost of the scalar loop and compares them.

The formula used in the patch initially is conservative I think, in that it allows larger runtime checks only if them failing only adds a small overhead to the cost of scalar loop in total.

Of course we can choose other formulas, e.g. computing the cost of all vector iterations + RT checks and compare it against the cost of all scalar iterations. This is more optimistic as it assumes the runtime checks succeed.

The main reason I went for the conservative approach initially was because we are already seeing regressions in benchmarks caused by runtime checks for vector loops never taken. I don't want to make this problem worse for now.

Personally I'd prefer to start with a more conservative heuristic to start with, see how it goes & iron out issues in the infrastructure.

How does that sound?

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

Ah I see, thanks! I think I meant that the patch seems to lift a different but related limitation (tiny trip count vectorization with epilogue) vs this patch which deals with the runtime check threshold.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Agreed, for cases where the trip count is unknown we will have to still make an educated guess. It should still be better/more informed than the single number cut-off for the number of runtime checks. But as I said, I think we should start with the cases where the trip count is known, make sure it works well for that case and move on from there. This also gives us time to iron out any issues with the infrastructure.

Harbormaster completed remote builds in B119790: Diff 366739.Aug 16 2021, 2:12 PM

Sure, slow (but steady!) forward progress is better than being stuck with subpar status-quo.
I don't really have anything against the current patch as-is,
with big fat note that there are further follow-up changes needed:

drop still-present hard cut-off on the number of the checks
support variable trip count
???

In D75981#2947904, @fhahn wrote:

Rebased again after recent changes.

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

The way I see it the current patch is already using a cost-model based approach: it already computes the cost of the runtime checks and the cost of the scalar loop and compares them.

The formula used in the patch initially is conservative I think, in that it allows larger runtime checks only if them failing only adds a small overhead to the cost of scalar loop in total.

Of course we can choose other formulas, e.g. computing the cost of all vector iterations + RT checks and compare it against the cost of all scalar iterations. This is more optimistic as it assumes the runtime checks succeed.

The main reason I went for the conservative approach initially was because we are already seeing regressions in benchmarks caused by runtime checks for vector loops never taken. I don't want to make this problem worse for now.

Personally I'd prefer to start with a more conservative heuristic to start with, see how it goes & iron out issues in the infrastructure.

How does that sound?

I'm fine to go with more conservative heuristic. The change I would like to see is to move runtime checks cost calculation inside cost model. This way CM.selectVectorizationFactor would return VF with cost of runtime checks already taken into account. It would be a bit inconvenient to "merge" with existing limits for runtime checks though. What if we just effectively disable current limits by putting them under an option for now and delete entire eventually?

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

Ah I see, thanks! I think I meant that the patch seems to lift a different but related limitation (tiny trip count vectorization with epilogue) vs this patch which deals with the runtime check threshold.

I think I understand now. Indeed, for the tests to make sense we would need to allow vectorization of short trip count loops with runtime checks. The question is taken off.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Agreed, for cases where the trip count is unknown we will have to still make an educated guess. It should still be better/more informed than the single number cut-off for the number of runtime checks. But as I said, I think we should start with the cases where the trip count is known, make sure it works well for that case and move on from there. This also gives us time to iron out any issues with the infrastructure.

Agree. This is unrelated to the current patch.

reverse-ping, thanks

bmahjour removed a subscriber: bmahjour.Aug 27 2021, 7:23 AM

lebedev.ri mentioned this in D109296: [LV] Improve inclusivity of vectorization.Sep 5 2021, 12:13 PM

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Also updates getCost to use InstructionCost instead of unsigned.

Harbormaster completed remote builds in B122881: Diff 371089.Sep 7 2021, 8:32 AM

fhahn retitled this revision from [LV] Allow large RT checks, if they are a fraction of the scalar cost. to [LV] Create RT checks during planning, expose cost functions..Sep 7 2021, 8:34 AM

fhahn edited the summary of this revision. (Show Details)

In D75981#2986956, @fhahn wrote:

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Sad to hear that. I can pick it up and try to push forward if you don't mind :-) But I'll wait while D109368 is landed since there is fair amount of dependencies. Sounds good?

In D75981#2988560, @ebrevnov wrote:

In D75981#2986956, @fhahn wrote:

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Sad to hear that. I can pick it up and try to push forward if you don't mind :-) But I'll wait while D109368 is landed since there is fair amount of dependencies. Sounds good?

The thing to keep in mind is, after D109296 (and i assume/hope it succeeds) adjusts the RT check budged allowance for variable loops,
i suspect most of the loops with constant trip count will also start getting vectorized, i suspect.

ebrevnov mentioned this in D109368: [LV] Vectorize cases with larger number of RT checks, execute only if profitable..Sep 8 2021, 10:20 AM

fhahn added a child revision: D109368: [LV] Vectorize cases with larger number of RT checks, execute only if profitable..Sep 10 2021, 6:06 AM

ebrevnov mentioned this in D109443: [LV] Lazy creation of runtime checks.Sep 26 2021, 11:49 PM

ebrevnov added inline comments.Oct 1 2021, 12:14 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
193 ↗	(On Diff #371089)	Scalar cost is not VF depending so this doesn't look like the best place for it. Please take a look at https://reviews.llvm.org/D109678 where I propose to cache scalar cost inside code model. How do you like it? Will it work for you as well?

This needs a rebase.

rebased

Harbormaster completed remote builds in B129407: Diff 380489.Oct 18 2021, 11:46 AM

rebase

Harbormaster completed remote builds in B131642: Diff 383669.Oct 31 2021, 12:14 PM

Another rebase, fixing a incorrect conflict resolution earlier. Tests should pass again.

Harbormaster completed remote builds in B133018: Diff 385497.Nov 8 2021, 7:39 AM

another rebase :)

Harbormaster completed remote builds in B135272: Diff 388712.Nov 20 2021, 10:17 AM

rebase

Harbormaster completed remote builds in B141341: Diff 397081.Jan 3 2022, 9:20 AM

Rebase

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 9:42 AM

Harbormaster completed remote builds in B158019: Diff 420559.Apr 5 2022, 9:43 AM

Rebase after recent changes.

Harbormaster completed remote builds in B166237: Diff 431937.May 25 2022, 4:13 AM

Do not expose getInstructionCost, as the latest version of D109368 can simplify use TTI.

Harbormaster completed remote builds in B166630: Diff 432533.May 27 2022, 5:41 AM

dmgreen accepted this revision.May 29 2022, 4:38 AM

strip unneeded changes, I plan to land this soon, after updating the description to reflect the state in the latest version

Harbormaster completed remote builds in B169815: Diff 436908.Jun 14 2022, 1:33 PM

fhahn retitled this revision from [LV] Create RT checks during planning, expose cost functions. to [LV] Create RT checks once VF/IC are selected, track scalar cost..Jun 20 2022, 5:31 AM

fhahn edited the summary of this revision. (Show Details)

This revision was landed with ongoing or failed builds.Jun 24 2022, 8:42 AM

Closed by commit rGcb69ba4faaf1: [LV] Create RT checks once VF/IC are selected, track scalar cost. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rGcb69ba4faaf1: [LV] Create RT checks once VF/IC are selected, track scalar cost..

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

3 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

13 lines

LoopVectorize.cpp

58 lines

test/

Transforms/

LoopVectorize/

AArch64/

runtime-check-size-based-threshold.ll

156 lines

Diff 315353

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 181 Lines • ▼ Show 20 Lines	public:
void addUnsafeAlgebraInst(Instruction *I) {		void addUnsafeAlgebraInst(Instruction *I) {
// First unsafe algebra instruction.		// First unsafe algebra instruction.
if (!UnsafeAlgebraInst)		if (!UnsafeAlgebraInst)
UnsafeAlgebraInst = I;		UnsafeAlgebraInst = I;
}		}

void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; }		void addRuntimePointerChecks(unsigned Num) { NumRuntimePointerChecks = Num; }

bool doesNotMeet(Function F, Loop L, const LoopVectorizeHints &Hints);		bool doesNotMeet(Function F, Loop L, const LoopVectorizeHints &Hints,
		bool CanIgnoreRTThreshold);

private:		private:
unsigned NumRuntimePointerChecks = 0;		unsigned NumRuntimePointerChecks = 0;
Instruction *UnsafeAlgebraInst = nullptr;		Instruction *UnsafeAlgebraInst = nullptr;

/// Interface to emit optimization remarks.		/// Interface to emit optimization remarks.
OptimizationRemarkEmitter &ORE;		OptimizationRemarkEmitter &ORE;
};		};
▲ Show 20 Lines • Show All 328 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 240 Lines • ▼ Show 20 Lines	if (Name == H->Name) {
H->Value = Val;		H->Value = Val;
else		else
LLVM_DEBUG(dbgs() << "LV: ignoring invalid hint '" << Name << "'\n");		LLVM_DEBUG(dbgs() << "LV: ignoring invalid hint '" << Name << "'\n");
break;		break;
}		}
}		}
}		}

bool LoopVectorizationRequirements::doesNotMeet(		bool LoopVectorizationRequirements::doesNotMeet(Function F, Loop L,
Function F, Loop L, const LoopVectorizeHints &Hints) {		const LoopVectorizeHints &Hints,
		bool IgnoreRTThreshold) {
const char *PassName = Hints.vectorizeAnalysisPassName();		const char *PassName = Hints.vectorizeAnalysisPassName();
bool Failed = false;		bool Failed = false;
if (UnsafeAlgebraInst && !Hints.allowReordering()) {		if (UnsafeAlgebraInst && !Hints.allowReordering()) {
ORE.emit([&]() {		ORE.emit([&]() {
return OptimizationRemarkAnalysisFPCommute(		return OptimizationRemarkAnalysisFPCommute(
PassName, "CantReorderFPOps", UnsafeAlgebraInst->getDebugLoc(),		PassName, "CantReorderFPOps", UnsafeAlgebraInst->getDebugLoc(),
UnsafeAlgebraInst->getParent())		UnsafeAlgebraInst->getParent())
<< "loop not vectorized: cannot prove it is safe to reorder "		<< "loop not vectorized: cannot prove it is safe to reorder "
"floating-point operations";		"floating-point operations";
});		});
Failed = true;		Failed = true;
}		}

// Test if runtime memcheck thresholds are exceeded.		// Test if runtime memcheck thresholds are exceeded.
bool PragmaThresholdReached =		bool PragmaThresholdReached =
NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold;		NumRuntimePointerChecks > PragmaVectorizeMemoryCheckThreshold;
bool ThresholdReached =		bool ThresholdReached =
NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold;		NumRuntimePointerChecks > VectorizerParams::RuntimeMemoryCheckThreshold;
if ((ThresholdReached && !Hints.allowReordering()) \|\|		bool DoubleThresholdReached =
PragmaThresholdReached) {		NumRuntimePointerChecks >
lebedev.riUnsubmitted Not Done Reply Inline Actions Hm, why does `PragmaThresholdReached` not check for `Hints.allowReordering()`? If i really asked for loop to be vectorized, why are there further limits on the sanity checks? Or, for that matter, doesn't forcing vectorization disable those checks in the first place? lebedev.ri: Hm, why does `PragmaThresholdReached` not check for `Hints.allowReordering()`? If i really…
fhahnAuthorUnsubmitted Done Reply Inline Actions I can't really comment much on why the existing code does what it does. But it is indeed surprising that runtime checks can block vectorization if a width is explicitly set. I put up D98634 and D98633 to move the `doesNotMeet` restrictions to what seems more appropriate places to me. The updated code also skips the checks if an explicit VF is forced by the user. I updated the code in this patch to always use the cost-based check if a constant TC is available here. fhahn: I can't really comment much on why the existing code does what it does. But it is indeed…
		2 * VectorizerParams::RuntimeMemoryCheckThreshold;
		if ((!IgnoreRTThreshold && ((ThresholdReached && !Hints.allowReordering()) \|\|
		PragmaThresholdReached)) \|\|
		(DoubleThresholdReached && !Hints.allowReordering())) {
ORE.emit([&]() {		ORE.emit([&]() {
return OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps",		return OptimizationRemarkAnalysisAliasing(PassName, "CantReorderMemOps",
L->getStartLoc(),		L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "loop not vectorized: cannot prove it is safe to reorder "		<< "loop not vectorized: cannot prove it is safe to reorder "
"memory operations";		"memory operations";
});		});
LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");		LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");
▲ Show 20 Lines • Show All 1,029 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 425 Lines • ▼ Show 20 Lines	if (auto EstimatedTC = getLoopEstimatedTripCount(L))
return EstimatedTC;		return EstimatedTC;

// Check if upper bound estimate is known.		// Check if upper bound estimate is known.
if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))		if (unsigned ExpectedTC = SE.getSmallConstantMaxTripCount(L))
return ExpectedTC;		return ExpectedTC;

return None;		return None;
}		}

struct GeneratedRTChecks;		struct GeneratedRTChecks;

namespace llvm {		namespace llvm {
/// InnerLoopVectorizer vectorizes loops which contain only one basic		/// InnerLoopVectorizer vectorizes loops which contain only one basic
/// block to a specified vectorization factor (VF).		/// block to a specified vectorization factor (VF).
/// This class performs the widening of scalars into vectors, or multiple		/// This class performs the widening of scalars into vectors, or multiple
/// scalars. This class also implements the following features:		/// scalars. This class also implements the following features:
/// * It inserts an epilogue loop for handling loops that don't have iteration		/// * It inserts an epilogue loop for handling loops that don't have iteration
/// counts that are known to be a multiple of the vectorization factor.		/// counts that are known to be a multiple of the vectorization factor.
/// * It handles the code generation for reduction variables.		/// * It handles the code generation for reduction variables.
▲ Show 20 Lines • Show All 1,169 Lines • ▼ Show 20 Lines	public:

/// Invalidates decisions already taken by the cost model.		/// Invalidates decisions already taken by the cost model.
void invalidateCostModelingDecisions() {		void invalidateCostModelingDecisions() {
WideningDecisions.clear();		WideningDecisions.clear();
Uniforms.clear();		Uniforms.clear();
Scalars.clear();		Scalars.clear();
}		}

private:
unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factor, a power-of-2 larger		/// \return An upper bound for the vectorization factor, a power-of-2 larger
/// than zero. One is returned if vectorization should best be avoided due		/// than zero. One is returned if vectorization should best be avoided due
/// to cost.		/// to cost.
ElementCount computeFeasibleMaxVF(unsigned ConstTripCount,		ElementCount computeFeasibleMaxVF(unsigned ConstTripCount,
ElementCount UserVF);		ElementCount UserVF);

/// The vectorization cost is a combination of the cost itself and a boolean		/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually		/// indicating whether any of the contributing operations will actually
/// operate on		/// operate on
/// vector values after type legalization in the backend. If this latter value		/// vector values after type legalization in the backend. If this latter value
/// is		/// is
/// false, then all operations will be scalarized (i.e. no vectorization has		/// false, then all operations will be scalarized (i.e. no vectorization has
/// actually taken place).		/// actually taken place).
using VectorizationCostTy = std::pair<unsigned, bool>;		using VectorizationCostTy = std::pair<unsigned, bool>;

		/// Returns the execution time cost of an instruction for a given vector
		/// width. Vector width of one means scalar.
		VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

		float ScalarCost;

		private:
		unsigned NumPredStores = 0;

/// Returns the expected execution cost. The unit of the cost does		/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different		/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by		/// vector widths. The cost that is returned is not normalized by
/// the factor width.		/// the factor width.
VectorizationCostTy expectedCost(ElementCount VF);		VectorizationCostTy expectedCost(ElementCount VF);

/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
unsigned getInstructionCost(Instruction I, ElementCount VF, Type &VectorTy);		unsigned getInstructionCost(Instruction I, ElementCount VF, Type &VectorTy);

/// Calculate vectorization cost of memory instruction \p I.		/// Calculate vectorization cost of memory instruction \p I.
unsigned getMemoryInstructionCost(Instruction *I, ElementCount VF);		unsigned getMemoryInstructionCost(Instruction *I, ElementCount VF);

/// The cost computation for scalarized memory instruction.		/// The cost computation for scalarized memory instruction.
▲ Show 20 Lines • Show All 243 Lines • ▼ Show 20 Lines	void Create(Loop *L, const LoopAccessInfo &LAI,
TmpBlock->replaceAllUsesWith(Preheader);		TmpBlock->replaceAllUsesWith(Preheader);
TmpBlock->getTerminator()->moveBefore(Preheader->getTerminator());		TmpBlock->getTerminator()->moveBefore(Preheader->getTerminator());
Preheader->getTerminator()->eraseFromParent();		Preheader->getTerminator()->eraseFromParent();
DT->changeImmediateDominator(LoopHeader, Preheader);		DT->changeImmediateDominator(LoopHeader, Preheader);
DT->eraseNode(TmpBlock);		DT->eraseNode(TmpBlock);
LI->removeBlock(TmpBlock);		LI->removeBlock(TmpBlock);
}		}

		unsigned getCost(LoopVectorizationCostModel &CM) {
		unsigned RTCheckCost = 0;
		for (Instruction &I : *TmpBlock)
		RTCheckCost += CM.getInstructionCost(&I, ElementCount::getFixed(1)).first;
		return RTCheckCost;
		}

~GeneratedRTChecks() {		~GeneratedRTChecks() {
if (!TmpBlock) {		if (!TmpBlock) {
Cleaner.markResultUsed();		Cleaner.markResultUsed();
return;		return;
}		}

if (!SCEVCheck && TmpBlock->empty()) {		if (!SCEVCheck && TmpBlock->empty()) {
Cleaner.markResultUsed();		Cleaner.markResultUsed();
▲ Show 20 Lines • Show All 3,902 Lines • ▼ Show 20 Lines
VectorizationFactor		VectorizationFactor
LoopVectorizationCostModel::selectVectorizationFactor(ElementCount MaxVF) {		LoopVectorizationCostModel::selectVectorizationFactor(ElementCount MaxVF) {
// FIXME: This can be fixed for scalable vectors later, because at this stage		// FIXME: This can be fixed for scalable vectors later, because at this stage
// the LoopVectorizer will only consider vectorizing a loop with scalable		// the LoopVectorizer will only consider vectorizing a loop with scalable
// vectors when the loop has a hint to enable vectorization for a given VF.		// vectors when the loop has a hint to enable vectorization for a given VF.
assert(!MaxVF.isScalable() && "scalable vectors not yet supported");		assert(!MaxVF.isScalable() && "scalable vectors not yet supported");

float Cost = expectedCost(ElementCount::getFixed(1)).first;		float Cost = expectedCost(ElementCount::getFixed(1)).first;
const float ScalarCost = Cost;		ScalarCost = Cost;
unsigned Width = 1;		unsigned Width = 1;
LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");		LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");

bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;		bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;
if (ForceVectorization && MaxVF.isVector()) {		if (ForceVectorization && MaxVF.isVector()) {
// Ignore scalar width, because the user explicitly wants vectorization.		// Ignore scalar width, because the user explicitly wants vectorization.
// Initialize cost to max so that VF = 2 is, at least, chosen during cost		// Initialize cost to max so that VF = 2 is, at least, chosen during cost
// evaluation.		// evaluation.
▲ Show 20 Lines • Show All 3,455 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
}		}

bool VectorizeLoop = true, InterleaveLoop = true;		bool VectorizeLoop = true, InterleaveLoop = true;
// Identify the diagnostic messages that should be produced.		// Identify the diagnostic messages that should be produced.
std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;		std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
if (VF.Width.isScalar()) {		if (VF.Width.isScalar()) {
LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");		LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
VecDiagMsg = std::make_pair(		VecDiagMsg = std::make_pair(
"VectorizationNotBeneficial",		"VectorizationNotBeneficial",
		lebedev.riUnsubmitted Not Done Reply Inline Actions Please make 0.005 an option lebedev.ri: Please make 0.005 an option
"the cost-model indicates that vectorization is not beneficial");		"the cost-model indicates that vectorization is not beneficial");
VectorizeLoop = false;		VectorizeLoop = false;
}		}

if (!MaybeVF && UserIC > 1) {		if (!MaybeVF && UserIC > 1) {
// Tell the user interleaving was avoided up-front, despite being explicitly		// Tell the user interleaving was avoided up-front, despite being explicitly
// requested.		// requested.
LLVM_DEBUG(dbgs() << "LV: Ignoring UserIC, because vectorization and "		LLVM_DEBUG(dbgs() << "LV: Ignoring UserIC, because vectorization and "
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	ORE->emit([&]() {
<< IntDiagMsg.second;		<< IntDiagMsg.second;
});		});
} else if (VectorizeLoop && InterleaveLoop) {		} else if (VectorizeLoop && InterleaveLoop) {
LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width		LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width
<< ") in " << DebugLocStr << '\n');		<< ") in " << DebugLocStr << '\n');
LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');		LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');
}		}

if (Requirements.doesNotMeet(F, L, Hints)) {
LLVM_DEBUG(dbgs() << "LV: Not vectorizing: loop did not meet vectorization "
"requirements.\n");
Hints.emitRemarkWithHints();
return false;
}

bool DisableRuntimeUnroll = false;		bool DisableRuntimeUnroll = false;
MDNode *OrigLoopID = L->getLoopID();		MDNode *OrigLoopID = L->getLoopID();
{		{
// Optimistically generate runtime checks. Drop them if they turn out to not		// Optimistically generate runtime checks. Drop them if they turn out to not
// be profitable. Limit the scope of Checks, so the cleanup happens		// be profitable. Limit the scope of Checks, so the cleanup happens
// immediately after vector codegeneration is done.		// immediately after vector codegeneration is done.
GeneratedRTChecks Checks(L->getLoopPreheader(), *PSE.getSE(), DT);		GeneratedRTChecks Checks(L->getLoopPreheader(), *PSE.getSE(), DT);
if (!VF.Width.isScalar() \|\| IC > 1)		bool CanIgnoreRTThreshold = true;
		if (!VF.Width.isScalar() \|\| IC > 1) {
		CanIgnoreRTThreshold = false;

Checks.Create(L, *LVL.getLAI(), PSE.getUnionPredicate(), LI);		Checks.Create(L, *LVL.getLAI(), PSE.getUnionPredicate(), LI);
		if (ExpectedTC) {
		unsigned RTCost = Checks.getCost(CM);
		// If the expected cost of the runtime checks is a small fraction of the
		// expected cost of the scalar loop, we can be more aggressive with
		// using runtime checks.
		CanIgnoreRTThreshold = RTCost < (ExpectedTC CM.ScalarCost * 0.005);
		LLVM_DEBUG(dbgs() << "LV: Cost of runtime check: " << RTCost << " "
		<< ExpectedTC CM.ScalarCost << "\n");
		}
		}

		if (Requirements.doesNotMeet(F, L, Hints, CanIgnoreRTThreshold)) {
		LLVM_DEBUG(
		dbgs() << "LV: Not vectorizing: loop did not meet vectorization "
		"requirements.\n");
		Hints.emitRemarkWithHints();
		return false;
		}

LVP.setBestPlan(VF.Width, IC);		LVP.setBestPlan(VF.Width, IC);

using namespace ore;		using namespace ore;
if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop, then		// If we decided that it is not legal to vectorize the loop, then
// interleave it.		// interleave it.
InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,		InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
▲ Show 20 Lines • Show All 200 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll

This file was added.

				; RUN: opt -loop-vectorize -mtriple=arm64-apple-iphoneos -S %s \| FileCheck %s

				%struct.snork = type <{ i32, i32, i16, [6 x i8], %struct.snork.0, i32, [4 x i8] }>
				%struct.snork.0 = type { [4 x %struct.zot] }
				%struct.zot = type { %struct.baz }
				%struct.baz = type { %struct.pluto }
				%struct.pluto = type { %struct.quux }
				%struct.quux = type { %struct.widget }
				%struct.widget = type { %struct.baz.1* }
				xbolva00Unsubmitted Not Done Reply Inline Actions is too small? xbolva00: is too small?
				%struct.baz.1 = type { i32 (...)**, %struct.zot.2 }
				%struct.zot.2 = type { %struct.pluto.3 }
				%struct.pluto.3 = type { %struct.bar }
				%struct.bar = type { %struct.barney, %struct.blam.4 }
				%struct.barney = type { %struct.blam }
				%struct.blam = type { i8 }
				%struct.blam.4 = type { i16, i16, i16* }
				%struct.foo = type { i32, i16*, i32, i32 }
				%struct.blam.5 = type { i32, i16*, i32, i32 }

				; The trip count in the loop in this function is too to warrant large runtime checks.
				; CHECK-LABEL: define {{.*}} @test_tc_too_small
				; CHECK-NOT: vector.memcheck
				; CHECK-NOT: vector.body
				define void @test_tc_too_small(%struct.snork* nocapture readonly %arg, %struct.foo* nocapture readonly byval(%struct.foo) align 8 %arg1, %struct.blam.5* nocapture readonly byval(%struct.blam.5) align 8 %arg2, %struct.blam.5* nocapture readonly byval(%struct.blam.5) align 8 %arg3) {
				entry:
				%tmp11 = getelementptr inbounds %struct.blam.5, %struct.blam.5* %arg3, i64 0, i32 0
				%tmp12 = load i32, i32* %tmp11, align 8
				%tmp13 = getelementptr inbounds %struct.blam.5, %struct.blam.5* %arg3, i64 0, i32 1
				%tmp14 = load i16, i16* %tmp13, align 8
				%tmp17 = getelementptr inbounds %struct.blam.5, %struct.blam.5* %arg2, i64 0, i32 1
				%tmp18 = load i16, i16* %tmp17, align 8
				%tmp19 = getelementptr inbounds %struct.foo, %struct.foo* %arg1, i64 0, i32 0
				%tmp20 = load i32, i32* %tmp19, align 8
				%tmp21 = getelementptr inbounds %struct.foo, %struct.foo* %arg1, i64 0, i32 1
				%tmp22 = load i16, i16* %tmp21, align 8
				%tmp23 = getelementptr inbounds %struct.snork, %struct.snork* %arg, i64 0, i32 1
				%tmp24 = load i32, i32* %tmp23, align 4
				%tmp26 = icmp sgt i32 %tmp24, 0
				%tmp39 = sext i32 %tmp12 to i64
				%tmp40 = shl nsw i64 %tmp39, 1
				%tmp41 = sext i32 %tmp20 to i64
				%tmp42 = getelementptr inbounds i16, i16* %tmp22, i64 %tmp41
				br label %bb54

				bb54: ; preds = %bb54, %bb37
				%tmp55 = phi i64 [ 0, %entry ], [ %tmp88, %bb54 ]
				%tmp56 = getelementptr inbounds i16, i16* %tmp18, i64 %tmp55
				%tmp57 = load i16, i16* %tmp56, align 2
				%tmp58 = sext i16 %tmp57 to i32
				%tmp59 = getelementptr inbounds i16, i16* %tmp14, i64 %tmp55
				%tmp60 = load i16, i16* %tmp59, align 2
				%tmp61 = sext i16 %tmp60 to i32
				%tmp62 = mul nsw i32 %tmp61, 11
				%tmp63 = getelementptr inbounds i16, i16* %tmp59, i64 %tmp39
				%tmp64 = load i16, i16* %tmp63, align 2
				%tmp65 = sext i16 %tmp64 to i32
				%tmp66 = mul nsw i32 %tmp65, -4
				%tmp67 = getelementptr inbounds i16, i16* %tmp59, i64 %tmp40
				%tmp68 = load i16, i16* %tmp67, align 2
				%tmp69 = sext i16 %tmp68 to i32
				%tmp70 = add nsw i32 %tmp62, 4
				%tmp71 = add nsw i32 %tmp70, %tmp66
				%tmp72 = add nsw i32 %tmp71, %tmp69
				%tmp73 = lshr i32 %tmp72, 3
				%tmp74 = add nsw i32 %tmp73, %tmp58
				%tmp75 = lshr i32 %tmp74, 1
				%tmp76 = mul nsw i32 %tmp61, 5
				%tmp77 = shl nsw i32 %tmp65, 2
				%tmp78 = add nsw i32 %tmp76, 4
				%tmp79 = add nsw i32 %tmp78, %tmp77
				%tmp80 = sub nsw i32 %tmp79, %tmp69
				%tmp81 = lshr i32 %tmp80, 3
				%tmp82 = sub nsw i32 %tmp81, %tmp58
				%tmp83 = lshr i32 %tmp82, 1
				%tmp84 = trunc i32 %tmp75 to i16
				%tmp85 = getelementptr inbounds i16, i16* %tmp22, i64 %tmp55
				store i16 %tmp84, i16* %tmp85, align 2
				%tmp86 = trunc i32 %tmp83 to i16
				%tmp87 = getelementptr inbounds i16, i16* %tmp42, i64 %tmp55
				store i16 %tmp86, i16* %tmp87, align 2
				%tmp88 = add nuw nsw i64 %tmp55, 1
				%tmp89 = icmp ult i64 %tmp55, 50
				br i1 %tmp89, label %bb54, label %bb90

				bb90: ; preds = %bb54, %bb27, %bb
				ret void
				}

				; The trip count in the loop in this function high enough to warrant large runtime checks.
				; CHECK-LABEL: define {{.*}} @test_tc_big_enough
				; CHECK: vector.memcheck
				; CHECK: vector.body
				define void @test_tc_big_enough(%struct.snork* nocapture readonly %arg, %struct.foo* nocapture readonly byval(%struct.foo) align 8 %arg1, %struct.blam.5* nocapture readonly byval(%struct.blam.5) align 8 %arg2, %struct.blam.5* nocapture readonly byval(%struct.blam.5) align 8 %arg3) {
				entry:
				%tmp11 = getelementptr inbounds %struct.blam.5, %struct.blam.5* %arg3, i64 0, i32 0
				%tmp12 = load i32, i32* %tmp11, align 8
				%tmp13 = getelementptr inbounds %struct.blam.5, %struct.blam.5* %arg3, i64 0, i32 1
				%tmp14 = load i16, i16* %tmp13, align 8
				%tmp17 = getelementptr inbounds %struct.blam.5, %struct.blam.5* %arg2, i64 0, i32 1
				%tmp18 = load i16, i16* %tmp17, align 8
				%tmp19 = getelementptr inbounds %struct.foo, %struct.foo* %arg1, i64 0, i32 0
				%tmp20 = load i32, i32* %tmp19, align 8
				%tmp21 = getelementptr inbounds %struct.foo, %struct.foo* %arg1, i64 0, i32 1
				%tmp22 = load i16, i16* %tmp21, align 8
				%tmp23 = getelementptr inbounds %struct.snork, %struct.snork* %arg, i64 0, i32 1
				%tmp24 = load i32, i32* %tmp23, align 4
				%tmp26 = icmp sgt i32 %tmp24, 0
				%tmp39 = sext i32 %tmp12 to i64
				%tmp40 = shl nsw i64 %tmp39, 1
				%tmp41 = sext i32 %tmp20 to i64
				%tmp42 = getelementptr inbounds i16, i16* %tmp22, i64 %tmp41
				br label %bb54

				bb54: ; preds = %bb54, %bb37
				%tmp55 = phi i64 [ 0, %entry ], [ %tmp88, %bb54 ]
				%tmp56 = getelementptr inbounds i16, i16* %tmp18, i64 %tmp55
				%tmp57 = load i16, i16* %tmp56, align 2
				%tmp58 = sext i16 %tmp57 to i32
				%tmp59 = getelementptr inbounds i16, i16* %tmp14, i64 %tmp55
				%tmp60 = load i16, i16* %tmp59, align 2
				%tmp61 = sext i16 %tmp60 to i32
				%tmp62 = mul nsw i32 %tmp61, 11
				%tmp63 = getelementptr inbounds i16, i16* %tmp59, i64 %tmp39
				%tmp64 = load i16, i16* %tmp63, align 2
				%tmp65 = sext i16 %tmp64 to i32
				%tmp66 = mul nsw i32 %tmp65, -4
				%tmp67 = getelementptr inbounds i16, i16* %tmp59, i64 %tmp40
				%tmp68 = load i16, i16* %tmp67, align 2
				%tmp69 = sext i16 %tmp68 to i32
				%tmp70 = add nsw i32 %tmp62, 4
				%tmp71 = add nsw i32 %tmp70, %tmp66
				%tmp72 = add nsw i32 %tmp71, %tmp69
				%tmp73 = lshr i32 %tmp72, 3
				%tmp74 = add nsw i32 %tmp73, %tmp58
				%tmp75 = lshr i32 %tmp74, 1
				%tmp76 = mul nsw i32 %tmp61, 5
				%tmp77 = shl nsw i32 %tmp65, 2
				%tmp78 = add nsw i32 %tmp76, 4
				%tmp79 = add nsw i32 %tmp78, %tmp77
				%tmp80 = sub nsw i32 %tmp79, %tmp69
				%tmp81 = lshr i32 %tmp80, 3
				%tmp82 = sub nsw i32 %tmp81, %tmp58
				%tmp83 = lshr i32 %tmp82, 1
				%tmp84 = trunc i32 %tmp75 to i16
				%tmp85 = getelementptr inbounds i16, i16* %tmp22, i64 %tmp55
				store i16 %tmp84, i16* %tmp85, align 2
				%tmp86 = trunc i32 %tmp83 to i16
				%tmp87 = getelementptr inbounds i16, i16* %tmp42, i64 %tmp55
				store i16 %tmp86, i16* %tmp87, align 2
				%tmp88 = add nuw nsw i64 %tmp55, 1
				%tmp89 = icmp ult i64 %tmp55, 500
				br i1 %tmp89, label %bb54, label %bb90

				bb90: ; preds = %bb54, %bb27, %bb
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Create RT checks once VF/IC are selected, track scalar cost.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 315353

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll

[LV] Create RT checks once VF/IC are selected, track scalar cost.
ClosedPublic