Page MenuHomePhabricator

[LV] Don't vectorize if we can prove RT + vector cost >= scalar cost.
AcceptedPublic

Authored by fhahn on Sep 7 2021, 8:35 AM.

Details

Summary

If we can prove that the cost of the runtime checks + the total vector
loop cost exceed the total scalar cost, vectorization with runtime
checks is not profitable.

This is a first step towards guarding against regressions in cases where
we already know runtime checks are unprofitable, as the heuristics get
tweaked.

Diff Detail

Event Timeline

fhahn created this revision.Sep 7 2021, 8:35 AM
fhahn requested review of this revision.Sep 7 2021, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2021, 8:35 AM

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

(to make it more obvious, the alternative patch is https://reviews.llvm.org/D109444)

I realized that doing similar thing in the cost model requires a bit of preparational work (about 6 patches). Due to that I think it may be reasonable to land this first. Please find my comments inlined. One thing I would like to ask you. In order to simplify merging with the mentioned 6 patches would be nice if you rebase your work on D109443 (hopefully can be landed quickly) and take D109444 (except line 10182) as part of this change.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8198

Would it be better to use !SelectedVF.Width.isScalar() instead. That will make it obvious that RTCost should be applied to vector loops only.

8199

In general, I would like us to agree on the strategy to follow when number of iterations is not known at compile time.
Currently, I see the following inconsistency in use of getSmallBestKnownTC. On the one hand, it may return upper bound estimate (SE.getSmallConstantMaxTripCount) which may greatly overestimate real trip count. On the over hand, it returns None if SE.getSmallConstantMaxTripCount didn't manage to get an estimate and we will end up skipping the check entirely.
I see 3 possible ways to follow in the case of unknow trip count:

  1. Don't check RT overhead if trip count is unknown at compile time. This is most conservative solution.
  2. Assume maximum possible number of iterations. In practice that will essentially give the same result as 1) (because RTCost / double(*ExpectedTC) would be less than 1 ). This is essentially what this patch does. In addition we would need to take max value for None as well
  3. Assume some fixed reasonable average number of iterations (may be extended to a more complex heuristic in future).

While in theory 3) is more general and may give better estimate it mat require panful tuning. Given that I would vote for 1) at this moment as the most conservative.

Note also regardless of chosen way to go we should decide how to categorize trip count deduced from profile. I personally think we should consider it same way as known trip count.

8202

In general case , total vector cost is "PrologCost + VectorCost*(ExpectedTC/Width) + EpilogCost, where (ExpectedTC/Width) is an integer division. RTCost is just a part of prolog cost. I think it is worth mentioning. Then we should explain why/how it reduces to "RTCost + VectorCost* (ExpectedTC/Width)". Thus we end up with the check "ScalarCost * ExpectedTC <= RTCost + VectorCost* (ExpectedTC/Width)". Division by ExpectedTC (using FP) makes sense to me and should give acceptable accuracy. Multiplication by 'Width' may give up to 'Width - 1' error. For that reason I would avoid doing multiplication by 'Width'. Thus we end up with "ScalarCost <= RTCost/(double)ExpectedTC + VectorCost/Width".

There are still two cases not taken into account (foldTailByMasking() and equiresScalarEpilogue()) but I think it's OK for this type of check.

Makes sense?

8211

I would suggest keeping result in a variable to avoid coping the (possibly non-trivial) formula.

fhahn updated this revision to Diff 375060.Sat, Sep 25, 11:54 AM

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

Thanks for taking a look!

I made a largish adjustment to the patch. It still uses the formula outlined in the original patch, but uses it differently: instead of computing the costs for known/expected trip counts it now computes the minimum trip count given the costs. This new minimum trip count can than be checked against the known/expected trip count. If there is no known/expected trip count, this minimum trip count is used for the minimum iteration check.

It also computes a second minimum to have a bound on the runtime overhead (if the checks fail) compared to the scalar loop. This should guard against failing runtime checks increasing the total runtime by more than a fraction of the scalar loop. (at the moment the fraction is hardcoded 1/10th of the total scalar loop cost, but I'll add an option for it if we converge)

I realized that doing similar thing in the cost model requires a bit of preparational work (about 6 patches). Due to that I think it may be reasonable to land this first. Please find my comments inlined. One thing I would like to ask you. In order to simplify merging with the mentioned 6 patches would be nice if you rebase your work on D109443 (hopefully can be landed quickly) and take D109444 (except line 10182) as part of this change.

Given the large update I did not adjust it yet on D109443. I'd suggest to discuss moving the code into the cost-model separately in your patches. I think at the moment the cost-model is mostly concerned with picking the best vectorization factor and focuses on computing costs for different VFs. I am not sure what the trade-offs are to adding more VF independent cost-modeling there yet (drawbacks are adding even more global state to the cost model, increasing complexity). I assume the motivation for moving it to the cost model is to allow using it when deciding on whether to vectorize given ScalarEpilogueStatus? I think that's best discussed separate with focus on that case.

Matt added a subscriber: Matt.Sat, Sep 25, 12:34 PM

FYI this patch can't be applied with arc easily:
error: llvm/test/Transforms/LoopVectorize/X86/pointer-runtime-checks-unprofitable.ll: does not exist in index

lebedev.ri accepted this revision.Sat, Sep 25, 3:32 PM

I have checked, and this is even more intrusive solution than the originally-disscussed design (D109296),
and it very successfully vectorizes the loops in question!

I must say, i really like this approach, since it is the only way to truly cover all the cases.
I basically had in mind something along the lines of "do we even need restrict the run-time checks",
i did not propose this as i thought it would be *too* radical.

I've gone over the math (both by hand and via sage, and it checks out)
As far as i'm concerned, this Looks Great.

@fhahn @Ayal thank you so much!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
190

Nit: what does prof mean? PGO profile? Profitable?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3303–3308

This looks like std::max(step, MinProfTripCount)?
Might be worth adding a comment.

8219
8221–8222

I'm rather sure we can't :)

8224–8229

I agree that here the math checks out, the sign is correct.

8227

This is not VecC, it's VecC/VF.
Not an error, but confusing.

8239–8243

So does this round up or down?
Use alignTo()/alignDown()?

This revision is now accepted and ready to land.Sat, Sep 25, 3:32 PM

Also, this patch should probably steal it's subject from D109296, the current one does not do it's justice.

ebrevnov added inline comments.Sun, Sep 26, 11:19 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
449

Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.

8222

The expression 'VecC imul (TC idiv VF)' is not associative since it involves an integer division (idiv). Thus it's not legal to simply replace it with 'TC fmul (VecC fdiv VF)'. Maximum possible error is up to 'VF'. In other words 'TC fmul (VecC fdiv VF)' - 'VecC imul (TC idiv VF)' <= VF. That means 'MinTC1' computed that way is an upper estimate of actual minimum. I don't see anything terribly bad in taking upper estimate but IMHO worth mentioning in comments.

8245

Please use alignTo() instead.

This sounds like a sensible general approach to take to me, from a cost point of view. It seems more like how I once heard the GCC vectorizer cost calculate described.

But there may be more work required on the costing calculations. It might be both under and over estimating the costs in places.
The first obvious problem I ran into was that the runtime checks it emits are not simplified prior to the costs being added for each instruction. The overflow checks seem to create "umul_with_overflow(|step|, TC). But the step is commonly 1 so given half a chance the instruction will be simplified away.
It looks like there were other inefficiencies with the values created too, with select false, x, y or or x, false being added to the cost. But the umul_with_overflow(1, TC) was getting a pretty high cost.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
449

Can this have a default value, to prevent the need for multiple constructors?

8208

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.

It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).

And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.

8224

Perhaps add some more assumptions, like the ones mentioned above and that (ScalarC - (VecC / VF)) > 0 from other profitability checks.

Given the large update I did not adjust it yet on D109443. I'd suggest to discuss moving the code into the cost-model separately in your patches.

Sure let's discuss move to cost model separately. Having sad that first four patches in the series (including D109443) are general improvements independent of the change of cost model itself.
Of cause, there is no strong dependency and D109443 should not block this one.

ebrevnov added inline comments.Mon, Sep 27, 12:33 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8208

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then).

Since we already compute upper estimate for MinTC1 it doesn't seem to be necessary to do additional adjustment when folding tail. But we probably want/need to do an adjustment for "requiresScalarEpilogue" case.

8224

It's totally fine for '(ScalarC - (VecC / VF))' to be negative since we take max(MinTC1, MinTC2) and MinTC2>=0

ebrevnov accepted this revision.Mon, Sep 27, 12:38 AM

Generally LGTM (except few mentioned nits)

nikic added a subscriber: nikic.Mon, Sep 27, 12:59 AM

Can you please provide an analysis of the code size impact this has? LLVM is already quite bad at unrolling/vectorizing too aggressively for non-benchmark code (which does not have hot loops) and I'm somewhat concerned this will make the situation even worse. Please do keep in mind that code size has important performance effects on icache and itlb misses, and just jumping over the code doesn't make that cost disappear.

I am not sure what the trade-offs are to adding more VF independent cost-modeling there yet (drawbacks are adding even more global state to the cost model, increasing complexity).

Right, while cost of runtime-checks (at the moment) is VF independent we can decide not to do it inside cost model. Though conceptually it looks much more solid design when CostModel is responsible for the cost based decisions. And total code complexity will be even less than doing cost modeling in different places.

I assume the motivation for moving it to the cost model is to allow using it when deciding on whether to vectorize given ScalarEpilogueStatus?

Right, cost of epilog depends on TC & VF. No excuse to do it outside cost model :-)

I think that's best discussed separate with focus on that case.

Sure can do it later. But will have to do a restructure first (essentially move existing code inside CM) otherwise there will be nasty code duplication.

fhahn updated this revision to Diff 375681.Tue, Sep 28, 1:21 PM
fhahn marked 17 inline comments as done.

Updated to address the inline comments. Thanks everyone! I hope I didn't miss any (apologies if I did).

Also, this patch should probably steal it's subject from D109296, the current one does not do it's justice.

Yep I need to update the title & description! I'll do that once I collected a bit more supporting data.

This sounds like a sensible general approach to take to me, from a cost point of view. It seems more like how I once heard the GCC vectorizer cost calculate described.

But there may be more work required on the costing calculations. It might be both under and over estimating the costs in places.
The first obvious problem I ran into was that the runtime checks it emits are not simplified prior to the costs being added for each instruction. The overflow checks seem to create "umul_with_overflow(|step|, TC). But the step is commonly 1 so given half a chance the instruction will be simplified away.
It looks like there were other inefficiencies with the values created too, with select false, x, y or or x, false being added to the cost. But the umul_with_overflow(1, TC) was getting a pretty high cost.

Thanks for raising this point! As you mentioned, we should try to make sure the code we emit for the runtime checks is as close to the final version as possible. The or x, false case should be easy to fix. For the umul_with_overflow I'd need to take a closer look to see where the best place to improve that would be.

Can you please provide an analysis of the code size impact this has? LLVM is already quite bad at unrolling/vectorizing too aggressively for non-benchmark code (which does not have hot loops) and I'm somewhat concerned this will make the situation even worse. Please do keep in mind that code size has important performance effects on icache and itlb misses, and just jumping over the code doesn't make that cost disappear.

I am still collecting the relevant data and I'll share it here once I have more details. On a high level the impact on the number of vectorized loops is relatively small on a large set of benchmarks (SPEC2006/SPEC2017/MultiSource) with a ~1% increase in vectorized loops on ARM64 with -O3.

So far the top size increase seem to be in smaller benchmarks. I need to take a closer look at IRSmk and smg2000 in particular. First inspection of IRSmk shows that is mostly consists of 2-3 large loops for which we did not generate runtime checks for so far, but do with that patch. Still need to check why the increase is so big.

test-suite...s/ASC_Sequoia/IRSmk/IRSmk.test   3240.00    26668.00   723.1%
test-suite...CI_Purple/SMG2000/smg2000.test   154724.00  403628.00  160.9%
test-suite...chmarks/Rodinia/srad/srad.test   4332.00    5736.00    32.4%
test-suite...Source/Benchmarks/sim/sim.test   13628.00   16660.00   22.2%
test-suite...pplications/oggenc/oggenc.test   202936.00  232608.00  14.6%
test-suite...oxyApps-C/miniGMG/miniGMG.test   51588.00   57792.00   12.0%
test-suite...arks/mafft/pairlocalalign.test   356532.00  397024.00  11.4%
test-suite...pps-C/SimpleMOC/SimpleMOC.test   30488.00   33436.00    9.7%
test-suite...oxyApps-C/miniAMR/miniAMR.test   51744.00   55580.00    7.4%
test-suite...rks/FreeBench/pifft/pifft.test   57368.00   59032.00    2.9%
test-suite...8.imagick_r/538.imagick_r.test   1484356.00 1518020.00  2.3%
test-suite...yApps-C++/PENNANT/PENNANT.test   105064.00  107004.00   1.8%
test-suite...rks/tramp3d-v4/tramp3d-v4.test   776340.00  788504.00   1.6%
test-suite...6/464.h264ref/464.h264ref.test   614848.00  623200.00   1.4%
llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
190

Changed to MinProfitableTripCount and add a comment

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
449

Can this have a default value, to prevent the need for multiple constructors?

unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.

Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.

That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up

8208

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.

I added a statement below about the fact that the computations are performed on doubles and later rounded up, giving an upper bound estimate as @ebrevnov suggested. Do you think that's sufficient?

It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).

Yep, that's a fundamental assumption at the moment. Unfortunately I cannot think of a good way to estimate the probability of the checks passing. If we assign a fixed probability per runtime check we are likely ending up with a hard limit like we have at the moment, just expressed differently.

The main motivation of MinTC2 below is to introduce a limit on the impact of a (large) number of runtime checks. The main goal is preventing evaluating large runtime checks for short running loops (at the moment we allow increasing total runtime by 10% due to failing runtime checks, but this could also be lower).

While there might be additional cases where failing runtime checks cause increase in runtime, the same problem already exists even with the hard-coded limit we have at the moment.

We could also change to way we emit runtime checks slightly and break them up across multiple blocks with earlier exits to increase the chances we do not have to evaluate all runtime checks if some fail.

And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.

Agreed, this can be an unfortunate side effect. Again, this is a problem we are already hitting and this patch will add a bit more vectorized loops. But I think in general the impact on the number of loops vectorized with this patch should be relatively small (forSPEC2006/SPEC2017/MultiSource ~1% more loops are vectorized). And I think unfortunately there's not much we can do to avoid this check in general.

One follow-up I think that becomes important is to make sure that we try to use PGO to detect cases where we create dead vector loops and skip vectorizing them.

8222

Thanks, I tried to update the comment to make this clearer.

8224

I think it should not really matter as @ebrevnov said, but it being negative may yield interesting cases to check the cost-modeling.

8227

Changed to VecCOverVF.

8239–8243

Ah that was the one I was looking for! updated thanks!

8245

updated, thanks!

lebedev.ri accepted this revision.Tue, Sep 28, 1:41 PM

Thank you, still looks great!

dmgreen added inline comments.Wed, Sep 29, 1:49 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
449

unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.

I was expecting it to use = ElementCount(), but this sounds OK too. Is the InnerLoopUnroller value deliberately 1, or would passing zero be better? I imagine it doesn't make much difference in practice.

8208

As far as I understand (correct me if I'm wrong!) we are essentially changing from code that looked like:

if (N < VF) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop

To the same code with, but with a different initial guard value and potentially more runtime checks in places:

if (N < MinProfitableTripCount) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop

That means that if we under-estimate MinProfitableTripCount we go into the runtime checks/vector loop, potentially executing a lot of expensive runtime checks where it is not profitable.
If we _over_ estimate the MinProfitableTripCount then at runtime we will not execute the vector code, falling back to the scalar loop. So we have generated larger/less efficient scalar code that then never executes the vector part, even if it would be profitable to do so.

So we end up in the unfortunate place where either over or under estimating the cost can lead to inefficiencies.

I'm not too worried about the details here. They sounds fine for the most part so long as they are close enough. I'm more worried about the cost of the runtime checks being over-estimated due to them being unsimplified prior to costing. I think that is where the worst regressions I am seeing from this patch come from. Loops where vector code was previously generated and executed are now skipped over. Unfortunately loops with lowish trip counts are common in a lot of code :)

The code in LoopVectorizationCostModel::isMoreProfitable already talks about the cost in terms of PerIterationCost*ceil(TripCount/VF) vs PerIterationCost*floor(TC/VF) though, and I would recommend describing things in the same way here, explaining that RtC + VecC * (TC / VF) + EpiC is a simplification of that.

That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up

There is no cost associated with runtime checks for epilogue vectorization. Thus we should simply initialize 'MinProfitableTripCount' with 'unset' value in Epilogue Vectorizer.

ebrevnov added inline comments.Fri, Oct 1, 12:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
460

IHMO we better keep semantic of 'MinProfitableTripCount' and not change it's value to something not coming from profitability considerations. Can easily do required adjustments at use site(s).