Page MenuHomePhabricator

[LV] Improve inclusivity of vectorization
Needs ReviewPublic

Authored by lebedev.ri on Sep 5 2021, 12:13 PM.

Details

Summary

Right now, LoopVectorizer has a hard limit on the number of runtime memory checks.
The limit is currently at 8, and while it generally works reasonably well,
as with all arbitrary limits, it's an arbitrary limit.

There are several problems with it:

  1. It puts a hard cap on the complexity of the loop it will vectorize Naturally, generally, the more pointer arithmetic/"objects" you have, the more checks are needed
  2. The number of runtime memory checks doesn't actually correlate with the overhead incurred by them. I've checked locally, and a single check can have a cost from 4 to 25...
  3. Why do we have this hard limit anyways? I guess because we want to avoid generating too many checks?
  4. How do we come up with the current limit?

Therefore, i would like to propose to completely change the approach here,
and to instead specify the budged for said checks in terms of multiples of cost
of a single iteration of the original scalar loop.

That is, if the cost of a single iteration of the original scalar loop is 10,
and the Multiple is 2, then the budged for the runtime checks is 10*2 = 20.

Currently i have looked for the optimal value for this threshold on RawSpeed and darktable,
and the results may be interesting:
https://docs.google.com/spreadsheets/d/1b3VPU1tPYGq0AO3XH3kBv3zdpKMby8aJzFl2cLSZ5AQ/edit?usp=sharing
Just to preserve all the existing vectorizations, we'd need to allow the cost of run-time checks
to be not greater than the cost of 6 iterations of scalar loop.

I know pretty much all of the code there should vectorize, because i (re)wrote most of it.
Originally, it was just manually vectorized with SSE2, but i've added plain fallbacks.

This is motivated by the bugreport https://bugs.llvm.org/show_bug.cgi?id=44662 i have filed
almost two years ago now. The code is inspired by/based on the code by @fhahn in D75981,
but unfortunately that patch is rather stuck, and vectorization area of llvm appears to be
a walled garden without much outside-of-the-club contributions, with latter being busy,
so i don't have much hope here :S

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 8a0999ddb98c..f4495cba57f5 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -8186,6 +8186,12 @@ LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC,
   // Check if it is profitable to vectorize with runtime checks.
   if (SelectedVF.Width.getKnownMinValue() > 1 &&
       Requirements.getNumRuntimePointerChecks()) {
+    errs() << "LV LAA num " << Requirements.getNumRuntimePointerChecks()
+           << " RTCost " << Checks.getCost(CM) << " ScalarLoopCost "
+           << SelectedVF.ScalarCost.getValue().getValue() << " fraction "
+           << (double)Checks.getCost(CM) /
+                  SelectedVF.ScalarCost.getValue().getValue()
+           << "\n";
     if (Checks.getCost(CM) >
         VectorizeMemoryCheckFactor * (*SelectedVF.ScalarCost.getValue())) {
       ORE->emit([&]() {

Diff Detail

Unit TestsFailed

TimeTest
420 msx64 debian > Clang.Frontend::optimization-remark-options.c
Script: -- : 'RUN: at line 2'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/clang -O1 -fvectorize -target x86_64-unknown-unknown -Rpass-analysis=loop-vectorize -emit-llvm -S /var/lib/buildkite-agent/builds/llvm-project/clang/test/Frontend/optimization-remark-options.c -o - 2>&1 | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/clang/test/Frontend/optimization-remark-options.c
810 msx64 debian > libomp.lock::omp_init_lock.c
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/./bin/clang -fopenmp -pthread -fno-experimental-isel -I /var/lib/buildkite-agent/builds/llvm-project/openmp/runtime/test -I /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/src -L /var/lib/buildkite-agent/builds/llvm-project/build/lib -I /var/lib/buildkite-agent/builds/llvm-project/openmp/runtime/test/ompt /var/lib/buildkite-agent/builds/llvm-project/openmp/runtime/test/lock/omp_init_lock.c -o /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/test/lock/Output/omp_init_lock.c.tmp -lm -latomic && /var/lib/buildkite-agent/builds/llvm-project/build/projects/openmp/runtime/test/lock/Output/omp_init_lock.c.tmp
490 msx64 windows > Clang.Frontend::optimization-remark-options.c
Script: -- : 'RUN: at line 2'; c:\ws\w3\llvm-project\premerge-checks\build\bin\clang.exe -O1 -fvectorize -target x86_64-unknown-unknown -Rpass-analysis=loop-vectorize -emit-llvm -S C:\ws\w3\llvm-project\premerge-checks\clang\test\Frontend\optimization-remark-options.c -o - 2>&1 | c:\ws\w3\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w3\llvm-project\premerge-checks\clang\test\Frontend\optimization-remark-options.c

Event Timeline

lebedev.ri created this revision.Sep 5 2021, 12:13 PM
lebedev.ri requested review of this revision.Sep 5 2021, 12:13 PM
nikic added a comment.Sep 5 2021, 1:14 PM

Shouldn't this be taking the trip count into account somehow? Generating 7 times the scalar loop in run-time checks seems okay if you have thousands of iterations, but would be bad for a cold loop with a few iterations. After a quick look at D75981 that variant does look at the TC. From the patch description, it's not entirely clear to me why a fixed factor of the scalar loop cost is preferable.

(Though as a comment on D75981, it uses getSmallBestKnownTC, which may return getSmallConstantMaxTripCount, which I believe will commonly just be INT_MAX for loops that are finite but have unknown trip count, so it may be drastically overestimating the TC for loops without exact trip count or profile data.)

Thank you for taking a look!

Shouldn't this be taking the trip count into account somehow? Generating 7 times the scalar loop in run-time checks seems okay if you have thousands of iterations, but would be bad for a cold loop with a few iterations. After a quick look at D75981 that variant does look at the TC. From the patch description, it's not entirely clear to me why a fixed factor of the scalar loop cost is preferable.

(Though as a comment on D75981, it uses getSmallBestKnownTC, which may return getSmallConstantMaxTripCount, which I believe will commonly just be INT_MAX for loops that are finite but have unknown trip count, so it may be drastically overestimating the TC for loops without exact trip count or profile data.)

You answered your question in your last paragraph.
I'm fundamentally interested in variable trip count loops,
where we don't know the actual constant backedge-taken count,
so we simply can't take backedge-taken count into account here.

I have figured out what was bothering me about counting checks - we are actually counting check *groups*,
where each group intentionally doesn't nessesairly contain only a single pointer, so the current design
doesn't account for the min/max reduction of pointers within group.

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.
I think the latter check is obviously cheaper than the former one :)
Those are real-world occurrences, not synthetic numbers.

One important epiphany occurred to me: with whatever formula we come up with,
it should not contain VF or cost of vectorized loop body,
because we'll pick threshold for vectorization using e.g. VF=8/AVX2,
but that will naturally result in smaller budget for checks
when LV will pick e.g VF=4/SSE4.2 as the best strategy.

As it stands right now, i think we roughly have 3 possible solutions:

  1. "it ain't broken, let's not fix it." i think it's pretty obviously broken.
  2. instead of counting the number of groups, count the total number of members. We could do that, but is not guaranteed to correlate with the final cost of checks, see RuntimeCheckingPtrGroup::addPointer() I don't think this is the right fix
  3. Just hardcode the budget. Might be a better approach than what we do now. I think then it should be ~160.
  4. My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.
  5. ??? Does anyone have better idea?
Matt added a subscriber: Matt.Sep 6 2021, 5:42 PM
fhahn added a comment.Sep 7 2021, 8:48 AM

Unfortunately I do not have the bandwidth or energy to get D75981 unstuck, as there are a few competing requests and my original plan was to fix the case where we have most information to make the best decision first (trip count available) and have people look into heuristics for their cases. However, most of the required infrastructure changes already landed and I updated D75981 to just contain the remaining non-heuristic changes.

Regardless of which heuristics we choose, I think we should not generate runtime checks if we can prove that RT + vector loop is more expensive than the scalar loop. I put up D109368 to guard against that.

instead of counting the number of groups, count the total number of members. We could do that, but is not guaranteed to correlate with the final cost of checks, see RuntimeCheckingPtrGroup::addPointer() I don't think this is the right fix

I don't think that's a good idea, as it does not consider the cost of other runtime checks (like those generated for SCEV predicates required for the memory checks). If we want to make better cost based decisions, we should definitely include all aspects of the runtime checks we generate.

Just hardcode the budget. Might be a better approach than what we do now. I think then it should be ~160.

As in a cost of 160 in terms of InstructionCost?

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Unfortunately I do not have the bandwidth or energy to get D75981 unstuck, as there are a few competing requests and my original plan was to fix the case where we have most information to make the best decision first (trip count available) and have people look into heuristics for their cases. However, most of the required infrastructure changes already landed and I updated D75981 to just contain the remaining non-heuristic changes.

Sure, i can understand that. Thank ypu!

Regardless of which heuristics we choose, I think we should not generate runtime checks if we can prove that RT + vector loop is more expensive than the scalar loop. I put up D109368 to guard against that.

If we know the constant trip count, or there is a guard info that specifies the maximal trip count you mean? (but *NOT* PGO data)
Sounds reasonable to me i think.

instead of counting the number of groups, count the total number of members. We could do that, but is not guaranteed to correlate with the final cost of checks, see RuntimeCheckingPtrGroup::addPointer() I don't think this is the right fix

I don't think that's a good idea, as it does not consider the cost of other runtime checks (like those generated for SCEV predicates required for the memory checks). If we want to make better cost based decisions, we should definitely include all aspects of the runtime checks we generate.

Yep, i remarked as much, this approach doesn't seem reasonable.

Just hardcode the budget. Might be a better approach than what we do now. I think then it should be ~160.

As in a cost of 160 in terms of InstructionCost?

Yep.

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

bmahjour removed a subscriber: bmahjour.Sep 7 2021, 9:15 AM

Hi @Ayal, @dorit!
The road forward here isn't mapped, so i would *love* to hear some thoughts
on the general direction of this patch, and on the steps needed to move this forward.

There is a certain overlap with @ebrevnov's patches; while i'm obviously biased,
i would say those changes would look more natural building ontop of this change.

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC? With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12? I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?

See the spreadsheet linked in the description, RTCost/ScalarCostPerIter is the new threshold, NumChecks is the old "threshold".
With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
But obviously this threshold is going going to vary somewhat per codebase,
as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.

I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
it's ~10M: (see VC5Decompressor::Wavelet::reconstructPass()::process() and VC5Decompressor::Wavelet::combineLowHighPass()::process())


... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
Perhaps PGO counters could be useful for that.

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

Hi @Ayal, @dorit!
The road forward here isn't mapped, so i would *love* to hear some thoughts
on the general direction of this patch, and on the steps needed to move this forward.

There is a certain overlap with @ebrevnov's patches; while i'm obviously biased,
i would say those changes would look more natural building ontop of this change.

I believe, these changes are quite independent. Well there are potential code conflicts (purely technical thing) and may be some cases when one affects the other but logically they are mostly orthogonal. Today cost model underestimates cost of vector version because it doesn't account for cost of RT checks and epilog loop. My patches try to fix that. It should help stop doing "definitely" unprofitable vectorization. This patch (and original version of D75981) introduces/replaces heuristic which tries to limit potential overhead (btw, should we take probability info into account as well? ) when/if vector version is skipped due to failed runtime checks. At least this is how I see it.

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?

See the spreadsheet linked in the description, RTCost/ScalarCostPerIter is the new threshold, NumChecks is the old "threshold".
With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
But obviously this threshold is going going to vary somewhat per codebase,
as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.

I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
it's ~10M: (see VC5Decompressor::Wavelet::reconstructPass()::process() and VC5Decompressor::Wavelet::combineLowHighPass()::process())


... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
Perhaps PGO counters could be useful for that.

I wasn't really thinking of a pragma (although it might be helpful in some cases), but a new compiler option (like `-fhigh-trip-count-assumption)
PGO should definitely be helpful here.

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

My original proposal. Specify the budget in terms of the number of scalar loop iterations. I think then it should be ~12.

Do you mean the cost of 12 scalar loop iterations?

Yep.

In the description you mentioned that ~6 is needed to match the current threshold, so 12 would be very roughly double the threshold IIUC?
With 12, do all expected loops get vectorized for RawSpeed and that's a motivation for 12?

See the spreadsheet linked in the description, RTCost/ScalarCostPerIter is the new threshold, NumChecks is the old "threshold".
With ~6 almost all loops in RawSpeed+darktable get vectorized (well, ignoring those LV doesn't get to/know how to vectorize)
But obviously this threshold is going going to vary somewhat per codebase,
as you can see in the spreadsheet, a somewhat higher limit is beneficial for other codebases i looked at.

I assume all loops in RawSpeed have sufficiently large trip counts (like >100 or > 1000)?

It will obviously depend on case-by-case basis, but yes; in the particular unvectorized loops that originally motivated
it's ~10M: (see VC5Decompressor::Wavelet::reconstructPass()::process() and VC5Decompressor::Wavelet::combineLowHighPass()::process())


... for a single input (https://raw.pixls.us/data-unique/GoPro/HERO6%20Black/GOPR9172.GPR)
For darktable loops i'd say the trip count is not smaller than 1M, averaging maybe around ~25M+, or higher depending on lack of unrolling.

I've been wondering if it might be worth to give the users an easier way to tell the compiler it should assume high trip counts. That might make our life easier for projects/files where this applies and could also help with other loop optimizations which only become profitable for larger trip counts (IIRC we had several reports by users where this could be helpful for not only vectorization)

FWIW i'm personally doing this because i'd like these things to just work without any pragmas.
Perhaps PGO counters could be useful for that.

I wasn't really thinking of a pragma (although it might be helpful in some cases), but a new compiler option (like `-fhigh-trip-count-assumption)
PGO should definitely be helpful here.

Hmm, sounds interesting, but it is still more complex than "just works out of the box" (:

Collected some more numbers (sheet updated) (+ rawtherapee, babl/geg/, vanilla llvm test-suite).
I think it showcases the problem quite well. We currently are okay with vectorizing
if that means emitting a check with checks=8,members=45,RTCost=146,
but not with checks=9,members=18,RTCost=36.

That's an interesting finding! Originally replacing the the old threshold with a cost-of-all-checks one did not seem very appealing to me. But if it would be possible to come up with a reasonable translation of the old one (not based on the cases where the cost currently is very much overestimated but some middle-ground), it might be a viable first step. But then it probably would be easier to just transition once and deal with the fallout.

The obvious problem is, however new limit we choose, as long as it no longer vectorizes *some* cases,
some of the cases we no longer vectorize were actually profitable to vectorize (i.e. run-time check being true at runtime) .

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?
That way we incur compile-time cost, filesize bloat, but not run-time cost.
If i got the idea right, that sounds rather too good to be true :) I really like it :)

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

This sounds truly awesome.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

Well okay then :)
So i guess what i need to to is to rebase this patch ontop of D109368, and simply methodically exterminate! exterminate! the compile-time limits instead of redesigning them, correct?

fhahn added a comment.Sat, Sep 25, 1:16 PM

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

This sounds truly awesome.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

Well okay then :)
So i guess what i need to to is to rebase this patch ontop of D109368, and simply methodically exterminate! exterminate! the compile-time limits instead of redesigning them, correct?

For LV, the threshold should have been already be removed, but it doesn't move RuntimeMemoryCheckThreshold, which is a main difference to this patch AFAICT :) It would be great if you could verify it works as expected with rawspeed.

There's another different option potentially allows us to actually side-step the issue of not knowing the trip count. Based on the formula used in the original version of D109368, we can compute the minimum trip-count required for the vector loop to be profitable. We can also compute a minimum trip count so that the cost of the runtime-check is only a fraction of the total scalar loop cost. We already emit a minimum iteration check which can be adjusted with the additional computed minimums. I think that would allow us to vectorize a lot more aggressively, while still guarding against runtime checks adding a large overhead if they fail for low trip count loops. I updated D109368 accordingly.

Ok, please check if i got this right: instead of having a hard compile-time cut-off for the checks, which we believe is used to guard against compile-time (and file-size) explosion,
we completely drop this limit, always vectorize, but before doing the run-time checks, we perform the trip-count checks, and if it fails, we fallback to scalar loop?

Yes, that should be it basically (we also skip vectorization *if* we already know that the expected trip count is less than the computed minimums; this should also include profile info). This approach was the result of an offline discussion with @Ayal.

This sounds truly awesome.

That way we incur compile-time cost, filesize bloat, but not run-time cost.

Yep, but I don't think compile-time cost and code size increases are much to worry about and are not the original motivation for the cutoff; too many runtime checks only prevents vectorization of 1% of otherwise vectorized loops in SPEC2006/SPEC2017/MultiSource with -O3. And when optimizing for size we currently do not allow runtime checks anyways.

Well okay then :)
So i guess what i need to to is to rebase this patch ontop of D109368, and simply methodically exterminate! exterminate! the compile-time limits instead of redesigning them, correct?

For LV, the threshold should have been already be removed, but it doesn't move RuntimeMemoryCheckThreshold, which is a main difference to this patch AFAICT :) It would be great if you could verify it works as expected with rawspeed.

I've verified, and i can happily confirm that D109368 by itself vectorizes the problematic loops in question,
rendering this patch effectively obsolete. I will abandon it once D109368 lands
@fhahn thank you!