This is an archive of the discontinued LLVM Phabricator instance.

[LV] Create RT checks once VF/IC are selected, track scalar cost.
ClosedPublic

Authored by fhahn on Mar 11 2020, 4:05 AM.

Details

Summary

This patch updates LV to generate runtime after the VF & IC are selected. It
allows deciding whether to vectorize with runtime checks or not based on
their cost compared to the vector loop.

It also updates VectorizationFactor to include the scalar cost.

Diff Detail

Event Timeline

fhahn created this revision.Mar 11 2020, 4:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2020, 4:05 AM
fhahn edited the summary of this revision. (Show Details)Mar 11 2020, 4:42 AM
lebedev.ri added inline comments.Mar 11 2020, 3:19 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10349

Please make 0.005 an option

Reverse-ping, thanks.
Anything to get this going?

fhahn updated this revision to Diff 282932.Aug 4 2020, 8:43 AM

rebase.

With the linked dependent patches, this should now successfully build test-suite with MultiSource/SPEC2000/SPEC2006.

This leads to additional vectorization with runtime checks in a few more cases:

Same hash: 223 (filtered out)
Remaining: 14
Metric: loop-vectorize.LoopsVectorized

Program patch1 patch2 diff
test-suite...Source/Benchmarks/sim/sim.test 5.00 8.00 60.0%
test-suite...rks/FreeBench/pifft/pifft.test 33.00 47.00 42.4%
test-suite...chmarks/Rodinia/srad/srad.test 3.00 4.00 33.3%
test-suite...CFP2000/177.mesa/177.mesa.test 379.00 417.00 10.0%
test-suite...CI_Purple/SMG2000/smg2000.test 78.00 84.00 7.7%
test-suite...pps-C/SimpleMOC/SimpleMOC.test 39.00 42.00 7.7%
test-suite...oxyApps-C/miniGMG/miniGMG.test 42.00 44.00 4.8%
test-suite.../CINT2000/176.gcc/176.gcc.test 97.00 100.00 3.1%
test-suite...006/450.soplex/450.soplex.test 88.00 90.00 2.3%
test-suite...lications/ClamAV/clamscan.test 91.00 93.00 2.2%
test-suite...pplications/oggenc/oggenc.test 130.00 132.00 1.5%
test-suite...006/447.dealII/447.dealII.test 958.00 970.00 1.3%

fhahn updated this revision to Diff 314337.Jan 4 2021, 2:12 AM

Rebased on top of current trunk. This version now can build MultiSource/SPEC2006/SPEC2000 with -O3 -flto without crashing.

The current version leads to a few more vectorized loops in some benchmarks:

Tests: 236
Same hash: 200 (filtered out)
Remaining: 36
Metric: loop-vectorize.LoopsVectorized

Program                                        base   patch.lv-mem-cost diff
 test-suite...Source/Benchmarks/sim/sim.test     5.00   8.00            60.0%
 test-suite...chmarks/Rodinia/srad/srad.test     3.00   4.00            33.3%
 test-suite...rks/FreeBench/pifft/pifft.test    33.00  43.00            30.3%
 test-suite...CFP2000/177.mesa/177.mesa.test   386.00 424.00             9.8%
 test-suite...CI_Purple/SMG2000/smg2000.test    77.00  83.00             7.8%
 test-suite...pps-C/SimpleMOC/SimpleMOC.test    39.00  42.00             7.7%
 test-suite...oxyApps-C/miniGMG/miniGMG.test    44.00  46.00             4.5%
 test-suite.../CINT2000/176.gcc/176.gcc.test    99.00 102.00             3.0%
 test-suite...006/450.soplex/450.soplex.test    88.00  90.00             2.3%
 test-suite...lications/ClamAV/clamscan.test    97.00  99.00             2.1%
 test-suite...pplications/oggenc/oggenc.test   151.00 153.00             1.3%
 test-suite...006/447.dealII/447.dealII.test   970.00 982.00             1.2%
rkruppe removed a subscriber: rkruppe.Jan 4 2021, 2:53 AM
fhahn updated this revision to Diff 315353.Jan 8 2021, 5:00 AM

rebase on top of the recent changes.

fhahn updated this revision to Diff 324996.Feb 19 2021, 8:39 AM

Rebased after recent changes to D75980

fhahn updated this revision to Diff 327404.Mar 2 2021, 3:28 AM

Ping.

All dependent patches have been submitted now. I also pre-committed a simplified version of the test cases in 0cb9d8acbccb.

The update also adds an option to control the threshold.

xbolva00 added inline comments.Mar 2 2021, 3:42 AM
llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll
9 ↗(On Diff #327404)

is too small?

fhahn updated this revision to Diff 327407.Mar 2 2021, 3:48 AM

Fix wording in test comments, thanks!

You better remove (WIP) suffix if this patch is ready for review. Otherwise people may think it still in progress...

lebedev.ri added inline comments.Mar 4 2021, 3:51 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270 ↗(On Diff #327407)

Hm, why does PragmaThresholdReached not check for Hints.allowReordering()?
If i really asked for loop to be vectorized, why are there further limits on the sanity checks?
Or, for that matter, doesn't forcing vectorization disable those checks in the first place?

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

fhahn updated this revision to Diff 330679.Mar 15 2021, 8:48 AM

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

That's a good point. I initially tried to keep things as closely modeled to the original code. I think the way the number of runtime checks is handled in LoopVectorizationRequirements is not ideal and makes things more difficult to follow. I am also not sure why those checks are handled separately (in doesNotMeet). I tried to see if we can remove doesNotMeet and instead move the checks at an earlier and more appropriate place.

I put up D98634 and D98633 to remove doesNotMeet, which moved the decision whether to vectorize with RT checks to LVP::plan(). I also updated this patch to make the cost-based decision in LVP::plan. From there it should be easy to adjust it further to use it for more cost-based decisions, as in your patch. Is this more in line what you had in mind?

fhahn added inline comments.Mar 15 2021, 8:54 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270 ↗(On Diff #327407)

I can't really comment much on why the existing code does what it does. But it is indeed surprising that runtime checks can block vectorization if a width is explicitly set. I put up D98634 and D98633 to move the doesNotMeet restrictions to what seems more appropriate places to me. The updated code also skips the checks if an explicit VF is forced by the user.

I updated the code in this patch to always use the cost-based check if a constant TC is available here.

fhahn retitled this revision from [LV] Allow large RT checks, if they are a fraction of the scalar cost (WIP) to [LV] Allow large RT checks, if they are a fraction of the scalar cost..Mar 29 2021, 9:31 AM
fhahn updated this revision to Diff 333908.Mar 29 2021, 9:32 AM

Rebase & ping.

All dependent patches have been submitted and the decision is now made during planning.

lebedev.ri accepted this revision.Mar 29 2021, 9:39 AM

LGTM.
As with other patches, i'm not the best reviewer for this, but clearly others are otherwise preoccupied.

This isn't the final form, i think this part is reasonably safe step.
Things can and will be adjusted further later.

This revision is now accepted and ready to land.Mar 29 2021, 9:39 AM

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

reverse-ping, thanks.

Futile weekly reverse-ping, thanks (:

Reverse ping, thanks.

fhahn updated this revision to Diff 354555.Jun 25 2021, 10:59 AM

I finally had some time to rebase this change and fix the fallout.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

Bump :)

I finally had some time to rebase this change and fix the fallout.

Hurray!
It would be good to finally have this resolved.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

reverse ping, thanks

ebrevnov added a comment.EditedJul 26 2021, 1:39 AM

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

It is *really* sad to see these two patches to be stuck :(

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

I think the goals of these two patches are largely correlated.

I think the main problem is that it isn't quite obvious why hard cut-offs on the runtime check complexity exist.
I guess, to not generate some very large and ridiculous checks.
But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

Reverse ping, thanks.

I think the main problem is that it isn't quite obvious why hard cut-offs on the runtime check complexity exist.
I guess, to not generate some very large and ridiculous checks.

I believe those cut-offs exist for the single reason. There was no way to calculate "real" cost of SCEV generated instructions. Now there is support for that and we can/should simply take cost of runtime checks into account in cost model.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Honestly in the end i'm not sure i know which approach is best, i just want to see this finally fixed :/

fhahn updated this revision to Diff 366739.Aug 16 2021, 2:11 PM

Rebased again after recent changes.

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

The way I see it the current patch is already using a cost-model based approach: it already computes the cost of the runtime checks and the cost of the scalar loop and compares them.

The formula used in the patch initially is conservative I think, in that it allows larger runtime checks only if them failing only adds a small overhead to the cost of scalar loop in total.

Of course we can choose other formulas, e.g. computing the cost of all vector iterations + RT checks and compare it against the cost of all scalar iterations. This is more optimistic as it assumes the runtime checks succeed.

The main reason I went for the conservative approach initially was because we are already seeing regressions in benchmarks caused by runtime checks for vector loops never taken. I don't want to make this problem worse for now.

Personally I'd prefer to start with a more conservative heuristic to start with, see how it goes & iron out issues in the infrastructure.

How does that sound?

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

Ah I see, thanks! I think I meant that the patch seems to lift a different but related limitation (tiny trip count vectorization with epilogue) vs this patch which deals with the runtime check threshold.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Agreed, for cases where the trip count is unknown we will have to still make an educated guess. It should still be better/more informed than the single number cut-off for the number of runtime checks. But as I said, I think we should start with the cases where the trip count is known, make sure it works well for that case and move on from there. This also gives us time to iron out any issues with the infrastructure.

Sure, slow (but steady!) forward progress is better than being stuck with subpar status-quo.
I don't really have anything against the current patch as-is,
with big fat note that there are further follow-up changes needed:

  • drop still-present hard cut-off on the number of the checks
  • support variable trip count
  • ???

Rebased again after recent changes.

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

The way I see it the current patch is already using a cost-model based approach: it already computes the cost of the runtime checks and the cost of the scalar loop and compares them.

The formula used in the patch initially is conservative I think, in that it allows larger runtime checks only if them failing only adds a small overhead to the cost of scalar loop in total.

Of course we can choose other formulas, e.g. computing the cost of all vector iterations + RT checks and compare it against the cost of all scalar iterations. This is more optimistic as it assumes the runtime checks succeed.

The main reason I went for the conservative approach initially was because we are already seeing regressions in benchmarks caused by runtime checks for vector loops never taken. I don't want to make this problem worse for now.

Personally I'd prefer to start with a more conservative heuristic to start with, see how it goes & iron out issues in the infrastructure.

How does that sound?

I'm fine to go with more conservative heuristic. The change I would like to see is to move runtime checks cost calculation inside cost model. This way CM.selectVectorizationFactor would return VF with cost of runtime checks already taken into account. It would be a bit inconvenient to "merge" with existing limits for runtime checks though. What if we just effectively disable current limits by putting them under an option for now and delete entire eventually?

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

Ah I see, thanks! I think I meant that the patch seems to lift a different but related limitation (tiny trip count vectorization with epilogue) vs this patch which deals with the runtime check threshold.

I think I understand now. Indeed, for the tests to make sense we would need to allow vectorization of short trip count loops with runtime checks. The question is taken off.

But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?

May be not the best approach, but taking some reasonable average across applications works good enough. Another metric to consider is benefit from vectorization. Thus loops with expected 3x improvement should be more likely vectorized than 1.1x.

Agreed, for cases where the trip count is unknown we will have to still make an educated guess. It should still be better/more informed than the single number cut-off for the number of runtime checks. But as I said, I think we should start with the cases where the trip count is known, make sure it works well for that case and move on from there. This also gives us time to iron out any issues with the infrastructure.

Agree. This is unrelated to the current patch.

reverse-ping, thanks

bmahjour removed a subscriber: bmahjour.Aug 27 2021, 7:23 AM
fhahn updated this revision to Diff 371089.Sep 7 2021, 8:31 AM

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Also updates getCost to use InstructionCost instead of unsigned.

fhahn retitled this revision from [LV] Allow large RT checks, if they are a fraction of the scalar cost. to [LV] Create RT checks during planning, expose cost functions..Sep 7 2021, 8:34 AM
fhahn edited the summary of this revision. (Show Details)

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Sad to hear that. I can pick it up and try to push forward if you don't mind :-) But I'll wait while D109368 is landed since there is fair amount of dependencies. Sounds good?

Unfortunately I do not have the bandwidth to get this unstuck at the moment. So I'm going to strip off the heuristics change and just keep the changes that set up the last bits of the infrastructure.

Sad to hear that. I can pick it up and try to push forward if you don't mind :-) But I'll wait while D109368 is landed since there is fair amount of dependencies. Sounds good?

The thing to keep in mind is, after D109296 (and i assume/hope it succeeds) adjusts the RT check budged allowance for variable loops,
i suspect most of the loops with constant trip count will also start getting vectorized, i suspect.

ebrevnov added inline comments.Oct 1 2021, 12:14 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
193

Scalar cost is not VF depending so this doesn't look like the best place for it. Please take a look at https://reviews.llvm.org/D109678 where I propose to cache scalar cost inside code model. How do you like it? Will it work for you as well?

This needs a rebase.

fhahn updated this revision to Diff 385497.Nov 8 2021, 7:38 AM

Another rebase, fixing a incorrect conflict resolution earlier. Tests should pass again.

fhahn updated this revision to Diff 388712.Nov 20 2021, 10:16 AM

another rebase :)

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 9:42 AM
fhahn updated this revision to Diff 431937.May 25 2022, 4:13 AM

Rebase after recent changes.

fhahn updated this revision to Diff 432533.May 27 2022, 5:41 AM

Do not expose getInstructionCost, as the latest version of D109368 can simplify use TTI.

dmgreen accepted this revision.May 29 2022, 4:38 AM
fhahn updated this revision to Diff 436908.Jun 14 2022, 1:32 PM

strip unneeded changes, I plan to land this soon, after updating the description to reflect the state in the latest version

fhahn retitled this revision from [LV] Create RT checks during planning, expose cost functions. to [LV] Create RT checks once VF/IC are selected, track scalar cost..Jun 20 2022, 5:31 AM
fhahn edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.Jun 24 2022, 8:42 AM
This revision was automatically updated to reflect the committed changes.