Page MenuHomePhabricator

[LV] Allow large RT checks, if they are a fraction of the scalar cost.
AcceptedPublic

Authored by fhahn on Mar 11 2020, 4:05 AM.

Details

Summary

This patch estimates the cost of the generated runtime checks and
relaxes the limit on the number of runtime checks, if the cost of the
runtime checks is a small fraction (0.5% of the expected scalar loop
runtime). The threshold (and other details) are not set in stone yet
and requires further benchmarking/analysis. Also, ExpectedTC returns
the max of the induction variable for loops without known constant trip
counts, which means we largely overestimate the cost of loops with
variable trip counts.

The current version also keeps a hard limit of 2 *
NumRuntimePointerChecks, but that also needs a better look.

If the general direction is agreed upon, I will hash out the final
details.

Fixes PR44662 (modulo potential adjustments for unknown trip counts)

Diff Detail

Event Timeline

fhahn created this revision.Mar 11 2020, 4:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2020, 4:05 AM
fhahn edited the summary of this revision. (Show Details)Mar 11 2020, 4:42 AM
lebedev.ri added inline comments.Mar 11 2020, 3:19 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9283

Please make 0.005 an option

Reverse-ping, thanks.
Anything to get this going?

fhahn updated this revision to Diff 282932.Aug 4 2020, 8:43 AM

rebase.

With the linked dependent patches, this should now successfully build test-suite with MultiSource/SPEC2000/SPEC2006.

This leads to additional vectorization with runtime checks in a few more cases:

Same hash: 223 (filtered out)
Remaining: 14
Metric: loop-vectorize.LoopsVectorized

Program patch1 patch2 diff
test-suite...Source/Benchmarks/sim/sim.test 5.00 8.00 60.0%
test-suite...rks/FreeBench/pifft/pifft.test 33.00 47.00 42.4%
test-suite...chmarks/Rodinia/srad/srad.test 3.00 4.00 33.3%
test-suite...CFP2000/177.mesa/177.mesa.test 379.00 417.00 10.0%
test-suite...CI_Purple/SMG2000/smg2000.test 78.00 84.00 7.7%
test-suite...pps-C/SimpleMOC/SimpleMOC.test 39.00 42.00 7.7%
test-suite...oxyApps-C/miniGMG/miniGMG.test 42.00 44.00 4.8%
test-suite.../CINT2000/176.gcc/176.gcc.test 97.00 100.00 3.1%
test-suite...006/450.soplex/450.soplex.test 88.00 90.00 2.3%
test-suite...lications/ClamAV/clamscan.test 91.00 93.00 2.2%
test-suite...pplications/oggenc/oggenc.test 130.00 132.00 1.5%
test-suite...006/447.dealII/447.dealII.test 958.00 970.00 1.3%

fhahn updated this revision to Diff 314337.Jan 4 2021, 2:12 AM

Rebased on top of current trunk. This version now can build MultiSource/SPEC2006/SPEC2000 with -O3 -flto without crashing.

The current version leads to a few more vectorized loops in some benchmarks:

Tests: 236
Same hash: 200 (filtered out)
Remaining: 36
Metric: loop-vectorize.LoopsVectorized

Program                                        base   patch.lv-mem-cost diff
 test-suite...Source/Benchmarks/sim/sim.test     5.00   8.00            60.0%
 test-suite...chmarks/Rodinia/srad/srad.test     3.00   4.00            33.3%
 test-suite...rks/FreeBench/pifft/pifft.test    33.00  43.00            30.3%
 test-suite...CFP2000/177.mesa/177.mesa.test   386.00 424.00             9.8%
 test-suite...CI_Purple/SMG2000/smg2000.test    77.00  83.00             7.8%
 test-suite...pps-C/SimpleMOC/SimpleMOC.test    39.00  42.00             7.7%
 test-suite...oxyApps-C/miniGMG/miniGMG.test    44.00  46.00             4.5%
 test-suite.../CINT2000/176.gcc/176.gcc.test    99.00 102.00             3.0%
 test-suite...006/450.soplex/450.soplex.test    88.00  90.00             2.3%
 test-suite...lications/ClamAV/clamscan.test    97.00  99.00             2.1%
 test-suite...pplications/oggenc/oggenc.test   151.00 153.00             1.3%
 test-suite...006/447.dealII/447.dealII.test   970.00 982.00             1.2%
rkruppe removed a subscriber: rkruppe.Jan 4 2021, 2:53 AM
fhahn updated this revision to Diff 315353.Jan 8 2021, 5:00 AM

rebase on top of the recent changes.

fhahn updated this revision to Diff 324996.Feb 19 2021, 8:39 AM

Rebased after recent changes to D75980

fhahn updated this revision to Diff 327404.Mar 2 2021, 3:28 AM

Ping.

All dependent patches have been submitted now. I also pre-committed a simplified version of the test cases in 0cb9d8acbccb.

The update also adds an option to control the threshold.

xbolva00 added inline comments.Mar 2 2021, 3:42 AM
llvm/test/Transforms/LoopVectorize/AArch64/runtime-check-size-based-threshold.ll
9

is too small?

fhahn updated this revision to Diff 327407.Mar 2 2021, 3:48 AM

Fix wording in test comments, thanks!

You better remove (WIP) suffix if this patch is ready for review. Otherwise people may think it still in progress...

lebedev.ri added inline comments.Mar 4 2021, 3:51 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270

Hm, why does PragmaThresholdReached not check for Hints.allowReordering()?
If i really asked for loop to be vectorized, why are there further limits on the sanity checks?
Or, for that matter, doesn't forcing vectorization disable those checks in the first place?

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

fhahn updated this revision to Diff 330679.Mar 15 2021, 8:48 AM

I'd like to understand why you chose to go the way it is instead of taking cost of runtime checks into account in the cost model itself? To me, the cost model is the right place for that. I would expect to see something similar to what I did in https://reviews.llvm.org/D71053 for LoopVectorizationPlanner::mayDisregardRTChecksOverhead

That's a good point. I initially tried to keep things as closely modeled to the original code. I think the way the number of runtime checks is handled in LoopVectorizationRequirements is not ideal and makes things more difficult to follow. I am also not sure why those checks are handled separately (in doesNotMeet). I tried to see if we can remove doesNotMeet and instead move the checks at an earlier and more appropriate place.

I put up D98634 and D98633 to remove doesNotMeet, which moved the decision whether to vectorize with RT checks to LVP::plan(). I also updated this patch to make the cost-based decision in LVP::plan. From there it should be easy to adjust it further to use it for more cost-based decisions, as in your patch. Is this more in line what you had in mind?

fhahn added inline comments.Mar 15 2021, 8:54 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
269–270

I can't really comment much on why the existing code does what it does. But it is indeed surprising that runtime checks can block vectorization if a width is explicitly set. I put up D98634 and D98633 to move the doesNotMeet restrictions to what seems more appropriate places to me. The updated code also skips the checks if an explicit VF is forced by the user.

I updated the code in this patch to always use the cost-based check if a constant TC is available here.

fhahn retitled this revision from [LV] Allow large RT checks, if they are a fraction of the scalar cost (WIP) to [LV] Allow large RT checks, if they are a fraction of the scalar cost..Mar 29 2021, 9:31 AM
fhahn updated this revision to Diff 333908.Mar 29 2021, 9:32 AM

Rebase & ping.

All dependent patches have been submitted and the decision is now made during planning.

lebedev.ri accepted this revision.Mar 29 2021, 9:39 AM

LGTM.
As with other patches, i'm not the best reviewer for this, but clearly others are otherwise preoccupied.

This isn't the final form, i think this part is reasonably safe step.
Things can and will be adjusted further later.

This revision is now accepted and ready to land.Mar 29 2021, 9:39 AM

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

@fhahn Reverse ping, thanks :)

Reverse ping, thanks.

reverse-ping, thanks.

Futile weekly reverse-ping, thanks (:

Reverse ping, thanks.

fhahn updated this revision to Diff 354555.Jun 25 2021, 10:59 AM

I finally had some time to rebase this change and fix the fallout.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

Bump :)

I finally had some time to rebase this change and fix the fallout.

Hurray!
It would be good to finally have this resolved.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

reverse ping, thanks

ebrevnov added a comment.EditedMon, Jul 26, 1:39 AM

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

It is *really* sad to see these two patches to be stuck :(

Sorry for long silence. Got into hospital with COVID-19 for almost a month.

I don't really understand why we need this separate heuristic for runtime checks. Why don't we simply add cost of runtime checks (possibly with some small scaling to be safe) to total cost of vector loop and just use existing cost model to decide?

That's a good point, but I think that would be better as separate change, because that's a more aggressive change than replacing existing limit. IIUC that's more in line with your D71053.

You are right, I'm essentially asking to follow D71053. First of all, in sake of doing progress I'm not going to block this change if you promise continue working on cost model driven approach.
But I personally think that it would save a lot of time if we go with cost model based approach in the first place because most time consuming thing would be fixing performance regressions and not the implementation itself. I will leave it on you to decide :-).

Please add regression tests from https://reviews.llvm.org/D71053 to this change set as well.

I'm not sure that there's much benefit at the moment, because there will be no changes. The focus of those tests seems to be more about vectorizing small trip count loops with an epilogue and not the cost of memory runtime checks (there are no memory runtime checks for the test I think)

I believe both test cases have vectorization with runtime checks. Look for "; CHECK: vector.memcheck:"

I think the goals of these two patches are largely correlated.

I think the main problem is that it isn't quite obvious why hard cut-offs on the runtime check complexity exist.
I guess, to not generate some very large and ridiculous checks.
But clearly, the current cut-offs are just bogusly low.
But i also guess simply bumping them won't really solve the problem,
so i guess we need to redefine them. But what is the right metric,
especially if the trip count is not constant?
Cost of a single scalar loop iteration?