This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/lib/Transforms/Vectorize/
-
lib/
-
Transforms/
-
Vectorize/
1
LoopVectorizationPlanner.h
1
LoopVectorize.cpp

Differential D75981

[LV] Create RT checks once VF/IC are selected, track scalar cost.
ClosedPublic

Authored by fhahn on Mar 11 2020, 4:05 AM.

Download Raw Diff

Details

Reviewers

rengolin
Ayal
gilr
hsaito
anemet
lebedev.ri
dmgreen

Commits

rGcb69ba4faaf1: [LV] Create RT checks once VF/IC are selected, track scalar cost.

Summary

This patch updates LV to generate runtime after the VF & IC are selected. It
allows deciding whether to vectorize with runtime checks or not based on
their cost compared to the vector loop.

It also updates VectorizationFactor to include the scalar cost.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.Mar 11 2020, 4:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2020, 4:05 AM

Herald added subscribers: rkruppe, hiraditya. · View Herald Transcript

fhahn mentioned this in D71053: [LV] Take overhead of run-time checks into account during vectorization..Mar 11 2020, 4:12 AM

fhahn added a parent revision: D75980: [LV] Generate RT checks up-front and remove them if required..Mar 11 2020, 4:35 AM

fhahn edited the summary of this revision. (Show Details)Mar 11 2020, 4:42 AM

Harbormaster failed remote builds in B48794: Diff 249581!Mar 11 2020, 5:44 AM

lebedev.ri added inline comments.Mar 11 2020, 3:19 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10447	Please make 0.005 an option

Reverse-ping, thanks.
Anything to get this going?

lebedev.ri added a reviewer: lebedev.ri.Jul 21 2020, 12:03 PM

rebase.

With the linked dependent patches, this should now successfully build test-suite with MultiSource/SPEC2000/SPEC2006.

This leads to additional vectorization with runtime checks in a few more cases:

Same hash: 223 (filtered out)
Remaining: 14
Metric: loop-vectorize.LoopsVectorized

Program patch1 patch2 diff
test-suite...Source/Benchmarks/sim/sim.test 5.00 8.00 60.0%
test-suite...rks/FreeBench/pifft/pifft.test 33.00 47.00 42.4%
test-suite...chmarks/Rodinia/srad/srad.test 3.00 4.00 33.3%
test-suite...CFP2000/177.mesa/177.mesa.test 379.00 417.00 10.0%
test-suite...CI_Purple/SMG2000/smg2000.test 78.00 84.00 7.7%
test-suite...pps-C/SimpleMOC/SimpleMOC.test 39.00 42.00 7.7%
test-suite...oxyApps-C/miniGMG/miniGMG.test 42.00 44.00 4.8%
test-suite.../CINT2000/176.gcc/176.gcc.test 97.00 100.00 3.1%
test-suite...006/450.soplex/450.soplex.test 88.00 90.00 2.3%
test-suite...lications/ClamAV/clamscan.test 91.00 93.00 2.2%
test-suite...pplications/oggenc/oggenc.test 130.00 132.00 1.5%
test-suite...006/447.dealII/447.dealII.test 958.00 970.00 1.3%

Harbormaster completed remote builds in B66944: Diff 282932.Aug 4 2020, 8:44 AM

xbolva00 added a subscriber: xbolva00.Aug 15 2020, 3:59 PM

Rebased on top of current trunk. This version now can build MultiSource/SPEC2006/SPEC2000 with -O3 -flto without crashing.

The current version leads to a few more vectorized loops in some benchmarks:

Tests: 236
Same hash: 200 (filtered out)
Remaining: 36
Metric: loop-vectorize.LoopsVectorized

Program                                        base   patch.lv-mem-cost diff
 test-suite...Source/Benchmarks/sim/sim.test     5.00   8.00            60.0%
 test-suite...chmarks/Rodinia/srad/srad.test     3.00   4.00            33.3%
 test-suite...rks/FreeBench/pifft/pifft.test    33.00  43.00            30.3%
 test-suite...CFP2000/177.mesa/177.mesa.test   386.00 424.00             9.8%
 test-suite...CI_Purple/SMG2000/smg2000.test    77.00  83.00             7.8%
 test-suite...pps-C/SimpleMOC/SimpleMOC.test    39.00  42.00             7.7%
 test-suite...oxyApps-C/miniGMG/miniGMG.test    44.00  46.00             4.5%
 test-suite.../CINT2000/176.gcc/176.gcc.test    99.00 102.00             3.0%
 test-suite...006/450.soplex/450.soplex.test    88.00  90.00             2.3%
 test-suite...lications/ClamAV/clamscan.test    97.00  99.00             2.1%
 test-suite...pplications/oggenc/oggenc.test   151.00 153.00             1.3%
 test-suite...006/447.dealII/447.dealII.test   970.00 982.00             1.2%

Harbormaster completed remote builds in B83877: Diff 314337.Jan 4 2021, 2:13 AM

rkruppe removed a subscriber: rkruppe.Jan 4 2021, 2:53 AM

rebase on top of the recent changes.

Harbormaster completed remote builds in B84454: Diff 315353.Jan 8 2021, 5:01 AM

rebase

Harbormaster completed remote builds in B89374: Diff 323978.Feb 16 2021, 6:01 AM

Rebased after recent changes to D75980

Harbormaster completed remote builds in B89928: Diff 324996.Feb 19 2021, 8:40 AM

fhahn mentioned this in rG0cb9d8acbccb: [LV] Add test cases that require a larger number of RT checks..Mar 2 2021, 2:50 AM

Ping.

All dependent patches have been submitted now. I also pre-committed a simplified version of the test cases in 0cb9d8acbccb.

The update also adds an option to control the threshold.

Harbormaster completed remote builds in B91538: Diff 327404.Mar 2 2021, 3:33 AM

xbolva00 added inline comments.Mar 2 2021, 3:42 AM

llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll
9 ↗	(On Diff #327404)	is too small?

Fix wording in test comments, thanks!

Harbormaster completed remote builds in B91541: Diff 327407.Mar 2 2021, 3:49 AM

You better remove (WIP) suffix if this patch is ready for review. Otherwise people may think it still in progress...

lebedev.ri added inline comments.Mar 4 2021, 3:51 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270 ↗	(On Diff #327407)	Hm, why does `PragmaThresholdReached` not check for `Hints.allowReordering()`? If i really asked for loop to be vectorized, why are there further limits on the sanity checks? Or, for that matter, doesn't forcing vectorization disable those checks in the first place?

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

In D75981#2605728, @ebrevnov wrote:

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

That's a good point. I initially tried to keep things as closely modeled to the original code. I think the way the number of runtime checks is handled in LoopVectorizationRequirements is not ideal and makes things more difficult to follow. I am also not sure why those checks are handled separately (in doesNotMeet). I tried to see if we can remove doesNotMeet and instead move the checks at an earlier and more appropriate place.

I put up D98634 and D98633 to remove doesNotMeet, which moved the decision whether to vectorize with RT checks to LVP::plan(). I also updated this patch to make the cost-based decision in LVP::plan. From there it should be easy to adjust it further to use it for more cost-based decisions, as in your patch. Is this more in line what you had in mind?

Herald added a subscriber: bmahjour. · View Herald TranscriptMar 15 2021, 8:48 AM

Harbormaster completed remote builds in B93828: Diff 330679.Mar 15 2021, 8:49 AM

fhahn added inline comments.Mar 15 2021, 8:54 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270 ↗	(On Diff #327407)	I can't really comment much on why the existing code does what it does. But it is indeed surprising that runtime checks can block vectorization if a width is explicitly set. I put up D98634 and D98633 to move the `doesNotMeet` restrictions to what seems more appropriate places to me. The updated code also skips the checks if an explicit VF is forced by the user. I updated the code in this patch to always use the cost-based check if a constant TC is available here.

fhahn retitled this revision from [LV] Allow large RT checks, if they are a fraction of the scalar cost (WIP) to [LV] Allow large RT checks, if they are a fraction of the scalar cost..Mar 29 2021, 9:31 AM

Rebase & ping.

All dependent patches have been submitted and the decision is now made during planning.

Harbormaster completed remote builds in B96141: Diff 333908.Mar 29 2021, 9:32 AM

LGTM.
As with other patches, i'm not the best reviewer for this, but clearly others are otherwise preoccupied.

This isn't the final form, i think this part is reasonably safe step.
Things can and will be adjusted further later.

This revision is now accepted and ready to land.Mar 29 2021, 9:39 AM

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

In D75981#2702505, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

In D75981#2722103, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2702505, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

In D75981#2732814, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2722103, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2702505, @lebedev.ri wrote:

@fhahn Reverse ping, thanks :)

In D75981#2688505, @lebedev.ri wrote:

Reverse ping, thanks.

reverse-ping, thanks.

Futile weekly reverse-ping, thanks (:

Reverse ping, thanks.

I finally had some time to rebase this change and fix the fallout.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

Harbormaster completed remote builds in B111038: Diff 354555.Jun 25 2021, 11:00 AM

Bump :)

In D75981#2841271, @fhahn wrote:

I finally had some time to rebase this change and fix the fallout.

Hurray!
It would be good to finally have this resolved.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

reverse ping, thanks

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

It is *really* sad to see these two patches to be stuck :(

In D75981#2903756, @ebrevnov wrote:

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

In D75981#2662881, @ebrevnov wrote:

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

I think the goals of these two patches are largely correlated.

I think the main problem is that it isn't quite obvious why hard cut-offs on the runtime check complexity exist.
I guess, to not generate some very large and ridiculous checks.
But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

Reverse ping, thanks.

I think the main problem is that it isn't quite obvious why hard cut-offs on the runtime check complexity exist.
I guess, to not generate some very large and ridiculous checks.

I believe those cut-offs exist for the single reason. There was no way to calculate "real" cost of SCEV generated instructions. Now there is support for that and we can/should simply take cost of runtime checks into account in cost model.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Honestly in the end i'm not sure i know which approach is best, i just want to see this finally fixed :/

Rebased again after recent changes.

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

The way I see it the current patch is already using a cost-model based approach: it already computes the cost of the runtime checks and the cost of the scalar loop and compares them.

The formula used in the patch initially is conservative I think, in that it allows larger runtime checks only if them failing only adds a small overhead to the cost of scalar loop in total.

Of course we can choose other formulas, e.g. computing the cost of all vector iterations + RT checks and compare it against the cost of all scalar iterations. This is more optimistic as it assumes the runtime checks succeed.

The main reason I went for the conservative approach initially was because we are already seeing regressions in benchmarks caused by runtime checks for vector loops never taken. I don't want to make this problem worse for now.

Personally I'd prefer to start with a more conservative heuristic to start with, see how it goes & iron out issues in the infrastructure.

How does that sound?

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

Ah I see, thanks! I think I meant that the patch seems to lift a different but related limitation (tiny trip count vectorization with epilogue) vs this patch which deals with the runtime check threshold.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Agreed, for cases where the trip count is unknown we will have to still make an educated guess. It should still be better/more informed than the single number cut-off for the number of runtime checks. But as I said, I think we should start with the cases where the trip count is known, make sure it works well for that case and move on from there. This also gives us time to iron out any issues with the infrastructure.

Harbormaster completed remote builds in B119790: Diff 366739.Aug 16 2021, 2:12 PM

Sure, slow (but steady!) forward progress is better than being stuck with subpar status-quo.
I don't really have anything against the current patch as-is,
with big fat note that there are further follow-up changes needed:

drop still-present hard cut-off on the number of the checks
support variable trip count
???

In D75981#2947904, @fhahn wrote:

Rebased again after recent changes.

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

The way I see it the current patch is already using a cost-model based approach: it already computes the cost of the runtime checks and the cost of the scalar loop and compares them.

The formula used in the patch initially is conservative I think, in that it allows larger runtime checks only if them failing only adds a small overhead to the cost of scalar loop in total.

Of course we can choose other formulas, e.g. computing the cost of all vector iterations + RT checks and compare it against the cost of all scalar iterations. This is more optimistic as it assumes the runtime checks succeed.

The main reason I went for the conservative approach initially was because we are already seeing regressions in benchmarks caused by runtime checks for vector loops never taken. I don't want to make this problem worse for now.

Personally I'd prefer to start with a more conservative heuristic to start with, see how it goes & iron out issues in the infrastructure.

How does that sound?

I'm fine to go with more conservative heuristic. The change I would like to see is to move runtime checks cost calculation inside cost model. This way CM.selectVectorizationFactor would return VF with cost of runtime checks already taken into account. It would be a bit inconvenient to "merge" with existing limits for runtime checks though. What if we just effectively disable current limits by putting them under an option for now and delete entire eventually?

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

Ah I see, thanks! I think I meant that the patch seems to lift a different but related limitation (tiny trip count vectorization with epilogue) vs this patch which deals with the runtime check threshold.

I think I understand now. Indeed, for the tests to make sense we would need to allow vectorization of short trip count loops with runtime checks. The question is taken off.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Agreed, for cases where the trip count is unknown we will have to still make an educated guess. It should still be better/more informed than the single number cut-off for the number of runtime checks. But as I said, I think we should start with the cases where the trip count is known, make sure it works well for that case and move on from there. This also gives us time to iron out any issues with the infrastructure.

Agree. This is unrelated to the current patch.

reverse-ping, thanks

bmahjour removed a subscriber: bmahjour.Aug 27 2021, 7:23 AM

lebedev.ri mentioned this in D109296: [LV] Improve inclusivity of vectorization.Sep 5 2021, 12:13 PM

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Also updates getCost to use InstructionCost instead of unsigned.

Harbormaster completed remote builds in B122881: Diff 371089.Sep 7 2021, 8:32 AM

fhahn retitled this revision from [LV] Allow large RT checks, if they are a fraction of the scalar cost. to [LV] Create RT checks during planning, expose cost functions..Sep 7 2021, 8:34 AM

fhahn edited the summary of this revision. (Show Details)

In D75981#2986956, @fhahn wrote:

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Sad to hear that. I can pick it up and try to push forward if you don't mind :-) But I'll wait while D109368 is landed since there is fair amount of dependencies. Sounds good?

In D75981#2988560, @ebrevnov wrote:

In D75981#2986956, @fhahn wrote:

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Sad to hear that. I can pick it up and try to push forward if you don't mind :-) But I'll wait while D109368 is landed since there is fair amount of dependencies. Sounds good?

The thing to keep in mind is, after D109296 (and i assume/hope it succeeds) adjusts the RT check budged allowance for variable loops,
i suspect most of the loops with constant trip count will also start getting vectorized, i suspect.

ebrevnov mentioned this in D109368: [LV] Vectorize cases with larger number of RT checks, execute only if profitable..Sep 8 2021, 10:20 AM

fhahn added a child revision: D109368: [LV] Vectorize cases with larger number of RT checks, execute only if profitable..Sep 10 2021, 6:06 AM

ebrevnov mentioned this in D109443: [LV] Lazy creation of runtime checks.Sep 26 2021, 11:49 PM

ebrevnov added inline comments.Oct 1 2021, 12:14 AM

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
196	Scalar cost is not VF depending so this doesn't look like the best place for it. Please take a look at https://reviews.llvm.org/D109678 where I propose to cache scalar cost inside code model. How do you like it? Will it work for you as well?

This needs a rebase.

rebased

Harbormaster completed remote builds in B129407: Diff 380489.Oct 18 2021, 11:46 AM

rebase

Harbormaster completed remote builds in B131642: Diff 383669.Oct 31 2021, 12:14 PM

Another rebase, fixing a incorrect conflict resolution earlier. Tests should pass again.

Harbormaster completed remote builds in B133018: Diff 385497.Nov 8 2021, 7:39 AM

another rebase :)

Harbormaster completed remote builds in B135272: Diff 388712.Nov 20 2021, 10:17 AM

rebase

Harbormaster completed remote builds in B141341: Diff 397081.Jan 3 2022, 9:20 AM

Rebase

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 9:42 AM

Harbormaster completed remote builds in B158019: Diff 420559.Apr 5 2022, 9:43 AM

Rebase after recent changes.

Harbormaster completed remote builds in B166237: Diff 431937.May 25 2022, 4:13 AM

Do not expose getInstructionCost, as the latest version of D109368 can simplify use TTI.

Harbormaster completed remote builds in B166630: Diff 432533.May 27 2022, 5:41 AM

dmgreen accepted this revision.May 29 2022, 4:38 AM

strip unneeded changes, I plan to land this soon, after updating the description to reflect the state in the latest version

Harbormaster completed remote builds in B169815: Diff 436908.Jun 14 2022, 1:33 PM

fhahn retitled this revision from [LV] Create RT checks during planning, expose cost functions. to [LV] Create RT checks once VF/IC are selected, track scalar cost..Jun 20 2022, 5:31 AM

fhahn edited the summary of this revision. (Show Details)

This revision was landed with ongoing or failed builds.Jun 24 2022, 8:42 AM

Closed by commit rGcb69ba4faaf1: [LV] Create RT checks once VF/IC are selected, track scalar cost. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn added a commit: rGcb69ba4faaf1: [LV] Create RT checks once VF/IC are selected, track scalar cost..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorizationPlanner.h

10 lines

LoopVectorize.cpp

27 lines

Diff 439784

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

	Show First 20 Lines • Show All 182 Lines • ▼ Show 20 Lines

	/// Information about vectorization costs.			/// Information about vectorization costs.
	struct VectorizationFactor {			struct VectorizationFactor {
	/// Vector width with best cost.			/// Vector width with best cost.
	ElementCount Width;			ElementCount Width;
	/// Cost of the loop with that width.			/// Cost of the loop with that width.
	InstructionCost Cost;			InstructionCost Cost;

	VectorizationFactor(ElementCount Width, InstructionCost Cost)			/// Cost of the scalar loop.
	: Width(Width), Cost(Cost) {}			InstructionCost ScalarCost;

				VectorizationFactor(ElementCount Width, InstructionCost Cost,
				InstructionCost ScalarCost)
				: Width(Width), Cost(Cost), ScalarCost(ScalarCost) {}
				ebrevnovUnsubmitted Not Done Reply Inline Actions Scalar cost is not VF depending so this doesn't look like the best place for it. Please take a look at https://reviews.llvm.org/D109678 where I propose to cache scalar cost inside code model. How do you like it? Will it work for you as well? ebrevnov: Scalar cost is not VF depending so this doesn't look like the best place for it. Please take a…

	/// Width 1 means no vectorization, cost 0 means uncomputed cost.			/// Width 1 means no vectorization, cost 0 means uncomputed cost.
	static VectorizationFactor Disabled() {			static VectorizationFactor Disabled() {
	return {ElementCount::getFixed(1), 0};			return {ElementCount::getFixed(1), 0, 0};
	}			}

	bool operator==(const VectorizationFactor &rhs) const {			bool operator==(const VectorizationFactor &rhs) const {
	return Width == rhs.Width && Cost == rhs.Cost;			return Width == rhs.Width && Cost == rhs.Cost;
	}			}

	bool operator!=(const VectorizationFactor &rhs) const {			bool operator!=(const VectorizationFactor &rhs) const {
	return !(*this == rhs);			return !(*this == rhs);
	▲ Show 20 Lines • Show All 162 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,292 Lines • ▼ Show 20 Lines
VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(		VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(
const ElementCountSet &VFCandidates) {		const ElementCountSet &VFCandidates) {
InstructionCost ExpectedCost = expectedCost(ElementCount::getFixed(1)).first;		InstructionCost ExpectedCost = expectedCost(ElementCount::getFixed(1)).first;
LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");		LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");
assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");		assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");
assert(VFCandidates.count(ElementCount::getFixed(1)) &&		assert(VFCandidates.count(ElementCount::getFixed(1)) &&
"Expected Scalar VF to be a candidate");		"Expected Scalar VF to be a candidate");

const VectorizationFactor ScalarCost(ElementCount::getFixed(1), ExpectedCost);		const VectorizationFactor ScalarCost(ElementCount::getFixed(1), ExpectedCost,
		ExpectedCost);
VectorizationFactor ChosenFactor = ScalarCost;		VectorizationFactor ChosenFactor = ScalarCost;

bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;		bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;
if (ForceVectorization && VFCandidates.size() > 1) {		if (ForceVectorization && VFCandidates.size() > 1) {
// Ignore scalar width, because the user explicitly wants vectorization.		// Ignore scalar width, because the user explicitly wants vectorization.
// Initialize cost to max so that VF = 2 is, at least, chosen during cost		// Initialize cost to max so that VF = 2 is, at least, chosen during cost
// evaluation.		// evaluation.
ChosenFactor.Cost = InstructionCost::getMax();		ChosenFactor.Cost = InstructionCost::getMax();
}		}

SmallVector<InstructionVFPair> InvalidCosts;		SmallVector<InstructionVFPair> InvalidCosts;
for (const auto &i : VFCandidates) {		for (const auto &i : VFCandidates) {
// The cost for scalar VF=1 is already calculated, so ignore it.		// The cost for scalar VF=1 is already calculated, so ignore it.
if (i.isScalar())		if (i.isScalar())
continue;		continue;

VectorizationCostTy C = expectedCost(i, &InvalidCosts);		VectorizationCostTy C = expectedCost(i, &InvalidCosts);
VectorizationFactor Candidate(i, C.first);		VectorizationFactor Candidate(i, C.first, ScalarCost.ScalarCost);

#ifndef NDEBUG		#ifndef NDEBUG
unsigned AssumedMinimumVscale = 1;		unsigned AssumedMinimumVscale = 1;
if (Optional<unsigned> VScale = getVScaleForTuning())		if (Optional<unsigned> VScale = getVScaleForTuning())
AssumedMinimumVscale = *VScale;		AssumedMinimumVscale = *VScale;
unsigned Width =		unsigned Width =
Candidate.Width.isScalable()		Candidate.Width.isScalable()
? Candidate.Width.getKnownMinValue() * AssumedMinimumVscale		? Candidate.Width.getKnownMinValue() * AssumedMinimumVscale
▲ Show 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	LLVM_DEBUG(
"not a supported candidate.\n";);		"not a supported candidate.\n";);
return Result;		return Result;
}		}

if (EpilogueVectorizationForceVF > 1) {		if (EpilogueVectorizationForceVF > 1) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n";);		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n";);
ElementCount ForcedEC = ElementCount::getFixed(EpilogueVectorizationForceVF);		ElementCount ForcedEC = ElementCount::getFixed(EpilogueVectorizationForceVF);
if (LVP.hasPlanWithVF(ForcedEC))		if (LVP.hasPlanWithVF(ForcedEC))
return {ForcedEC, 0};		return {ForcedEC, 0, 0};
else {		else {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LEV: Epilogue vectorization forced factor is not viable.\n";);		<< "LEV: Epilogue vectorization forced factor is not viable.\n";);
return Result;		return Result;
}		}
}		}

▲ Show 20 Lines • Show All 1,906 Lines • ▼ Show 20 Lines	if (!OrigLoop->isInnermost()) {
LLVM_DEBUG(dbgs() << "LV: Using " << (!UserVF.isZero() ? "user " : "")		LLVM_DEBUG(dbgs() << "LV: Using " << (!UserVF.isZero() ? "user " : "")
<< "VF " << VF << " to build VPlans.\n");		<< "VF " << VF << " to build VPlans.\n");
buildVPlans(VF, VF);		buildVPlans(VF, VF);

// For VPlan build stress testing, we bail out after VPlan construction.		// For VPlan build stress testing, we bail out after VPlan construction.
if (VPlanBuildStressTest)		if (VPlanBuildStressTest)
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();

return {VF, 0 /Cost/};		return {VF, 0 /Cost/, 0 /* ScalarCost */};
}		}

LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "		dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "
"VPlan-native path.\n");		"VPlan-native path.\n");
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();
}		}

Show All 34 Lines	assert(isPowerOf2_32(UserVF.getKnownMinValue()) &&
"VF needs to be a power of two");		"VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
if (CM.selectUserVectorizationFactor(UserVF)) {		if (CM.selectUserVectorizationFactor(UserVF)) {
LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
CM.collectInLoopReductions();		CM.collectInLoopReductions();
buildVPlansWithVPRecipes(UserVF, UserVF);		buildVPlansWithVPRecipes(UserVF, UserVF);
LLVM_DEBUG(printPlans(dbgs()));		LLVM_DEBUG(printPlans(dbgs()));
return {{UserVF, 0}};		return {{UserVF, 0, 0}};
} else		} else
reportVectorizationInfo("UserVF ignored because of invalid costs.",		reportVectorizationInfo("UserVF ignored because of invalid costs.",
"InvalidCost", ORE, OrigLoop);		"InvalidCost", ORE, OrigLoop);
}		}

// Populate the set of Vectorization Factor Candidates.		// Populate the set of Vectorization Factor Candidates.
ElementCountSet VFCandidates;		ElementCountSet VFCandidates;
for (auto VF = ElementCount::getFixed(1);		for (auto VF = ElementCount::getFixed(1);
▲ Show 20 Lines • Show All 2,911 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
unsigned UserIC = Hints.getInterleave();		unsigned UserIC = Hints.getInterleave();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC);		Optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC);

VectorizationFactor VF = VectorizationFactor::Disabled();		VectorizationFactor VF = VectorizationFactor::Disabled();
unsigned IC = 1;		unsigned IC = 1;

		GeneratedRTChecks Checks(*PSE.getSE(), DT, LI,
		F->getParent()->getDataLayout());
if (MaybeVF) {		if (MaybeVF) {
if (LVP.requiresTooManyRuntimeChecks()) {		if (LVP.requiresTooManyRuntimeChecks()) {
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemarkAnalysisAliasing(		return OptimizationRemarkAnalysisAliasing(
DEBUG_TYPE, "CantReorderMemOps", L->getStartLoc(),		DEBUG_TYPE, "CantReorderMemOps", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "loop not vectorized: cannot prove it is safe to reorder "		<< "loop not vectorized: cannot prove it is safe to reorder "
"memory operations";		"memory operations";
});		});
LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");		LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}
VF = *MaybeVF;		VF = *MaybeVF;
// Select the interleave count.		// Select the interleave count.
IC = CM.selectInterleaveCount(VF.Width, *VF.Cost.getValue());		IC = CM.selectInterleaveCount(VF.Width, *VF.Cost.getValue());

		unsigned SelectedIC = std::max(IC, UserIC);
		// Optimistically generate runtime checks if they are needed. Drop them if
		// they turn out to not be profitable.
		if (VF.Width.isVector() \|\| SelectedIC > 1)
		Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, SelectedIC);
}		}

// Identify the diagnostic messages that should be produced.		// Identify the diagnostic messages that should be produced.
std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;		std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
bool VectorizeLoop = true, InterleaveLoop = true;		bool VectorizeLoop = true, InterleaveLoop = true;
if (VF.Width.isScalar()) {		if (VF.Width.isScalar()) {
LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");		LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
VecDiagMsg = std::make_pair(		VecDiagMsg = std::make_pair(
"VectorizationNotBeneficial",		"VectorizationNotBeneficial",
		lebedev.riUnsubmitted Not Done Reply Inline Actions Please make 0.005 an option lebedev.ri: Please make 0.005 an option
"the cost-model indicates that vectorization is not beneficial");		"the cost-model indicates that vectorization is not beneficial");
VectorizeLoop = false;		VectorizeLoop = false;
}		}

if (!MaybeVF && UserIC > 1) {		if (!MaybeVF && UserIC > 1) {
// Tell the user interleaving was avoided up-front, despite being explicitly		// Tell the user interleaving was avoided up-front, despite being explicitly
// requested.		// requested.
LLVM_DEBUG(dbgs() << "LV: Ignoring UserIC, because vectorization and "		LLVM_DEBUG(dbgs() << "LV: Ignoring UserIC, because vectorization and "
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	if (!VectorizeLoop && !InterleaveLoop) {
LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width		LLVM_DEBUG(dbgs() << "LV: Found a vectorizable loop (" << VF.Width
<< ") in " << DebugLocStr << '\n');		<< ") in " << DebugLocStr << '\n');
LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');		LLVM_DEBUG(dbgs() << "LV: Interleave Count is " << IC << '\n');
}		}

bool DisableRuntimeUnroll = false;		bool DisableRuntimeUnroll = false;
MDNode *OrigLoopID = L->getLoopID();		MDNode *OrigLoopID = L->getLoopID();
{		{
// Optimistically generate runtime checks. Drop them if they turn out to not
// be profitable. Limit the scope of Checks, so the cleanup happens
// immediately after vector codegeneration is done.
GeneratedRTChecks Checks(*PSE.getSE(), DT, LI,
F->getParent()->getDataLayout());
if (!VF.Width.isScalar() \|\| IC > 1)
Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, IC);

using namespace ore;		using namespace ore;
if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop, then		// If we decided that it is not legal to vectorize the loop, then
// interleave it.		// interleave it.
InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,		InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
&CM, BFI, PSI, Checks);		&CM, BFI, PSI, Checks);

▲ Show 20 Lines • Show All 236 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Create RT checks once VF/IC are selected, track scalar cost.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 439784

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

[LV] Create RT checks once VF/IC are selected, track scalar cost.
ClosedPublic