This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorize cases with larger number of RT checks, execute only if profitable.
ClosedPublic

Authored by fhahn on Sep 7 2021, 8:35 AM.

Details

Summary

This patch replaces the tight hard cut-off for the number of runtime
checks with a more accurate cost-driven approach.

The new approach allows vectorization with a larger number of runtime
checks in general, but only executes the vector loop (and runtime checks) if
considered profitable at runtime. Profitable here means that the cost-model
indicates that the runtime check cost + vector loop cost < scalar loop cost.

To do that, LV computes the minimum trip count for which runtime check cost
+ vector-loop-cost < scalar loop cost.

Note that there is still a hard cut-off to avoid excessive compile-time/code-size
increases, but it is much larger than the original limit.

The performance impact on standard test-suites like SPEC2006/SPEC2006/MultiSource
is mostly neutral, but the new approach can give substantial gains in cases where
we failed to vectorize before due to the over-aggressive cut-offs.

On AArch64 with -O3, I didn't observe any regressions outside the noise level (<0.4%)
and there are the following execution time improvements. Both IRSmk and srad are relatively short running, but the changes are far above the noise level for them on my benchmark system.

CFP2006/447.dealII/447.dealII    -1.9%
CINT2017rate/525.x264_r/525.x264_r    -2.2%
ASC_Sequoia/IRSmk/IRSmk       -9.2%
Rodinia/srad/srad     -36.1%

size regressions on AArch64 with -O3 are

MultiSource/Applications/hbd/hbd                 90256.00   106768.00 18.3%
MultiSourc...ks/ASCI_Purple/SMG2000/smg2000     240676.00   257268.00  6.9%
MultiSourc...enchmarks/mafft/pairlocalalign     472603.00   489131.00  3.5%
External/S...2017rate/525.x264_r/525.x264_r     613831.00   630343.00  2.7%
External/S...NT2006/464.h264ref/464.h264ref     818920.00   835448.00  2.0%
External/S...te/538.imagick_r/538.imagick_r    1994730.00  2027754.00  1.7%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4    1236471.00  1253015.00  1.3%
MultiSource/Applications/oggenc/oggenc         2108147.00  2124675.00  0.8%
External/S.../CFP2006/447.dealII/447.dealII    4742999.00  4759559.00  0.3%
External/S...rate/510.parest_r/510.parest_r   14206377.00 14239433.00  0.2%

Diff Detail

Event Timeline

fhahn created this revision.Sep 7 2021, 8:35 AM
fhahn requested review of this revision.Sep 7 2021, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2021, 8:35 AM

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

(to make it more obvious, the alternative patch is https://reviews.llvm.org/D109444)

I realized that doing similar thing in the cost model requires a bit of preparational work (about 6 patches). Due to that I think it may be reasonable to land this first. Please find my comments inlined. One thing I would like to ask you. In order to simplify merging with the mentioned 6 patches would be nice if you rebase your work on D109443 (hopefully can be landed quickly) and take D109444 (except line 10182) as part of this change.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7508

Would it be better to use !SelectedVF.Width.isScalar() instead. That will make it obvious that RTCost should be applied to vector loops only.

7509

In general, I would like us to agree on the strategy to follow when number of iterations is not known at compile time.
Currently, I see the following inconsistency in use of getSmallBestKnownTC. On the one hand, it may return upper bound estimate (SE.getSmallConstantMaxTripCount) which may greatly overestimate real trip count. On the over hand, it returns None if SE.getSmallConstantMaxTripCount didn't manage to get an estimate and we will end up skipping the check entirely.
I see 3 possible ways to follow in the case of unknow trip count:

  1. Don't check RT overhead if trip count is unknown at compile time. This is most conservative solution.
  2. Assume maximum possible number of iterations. In practice that will essentially give the same result as 1) (because RTCost / double(*ExpectedTC) would be less than 1 ). This is essentially what this patch does. In addition we would need to take max value for None as well
  3. Assume some fixed reasonable average number of iterations (may be extended to a more complex heuristic in future).

While in theory 3) is more general and may give better estimate it mat require panful tuning. Given that I would vote for 1) at this moment as the most conservative.

Note also regardless of chosen way to go we should decide how to categorize trip count deduced from profile. I personally think we should consider it same way as known trip count.

7512

In general case , total vector cost is "PrologCost + VectorCost*(ExpectedTC/Width) + EpilogCost, where (ExpectedTC/Width) is an integer division. RTCost is just a part of prolog cost. I think it is worth mentioning. Then we should explain why/how it reduces to "RTCost + VectorCost* (ExpectedTC/Width)". Thus we end up with the check "ScalarCost * ExpectedTC <= RTCost + VectorCost* (ExpectedTC/Width)". Division by ExpectedTC (using FP) makes sense to me and should give acceptable accuracy. Multiplication by 'Width' may give up to 'Width - 1' error. For that reason I would avoid doing multiplication by 'Width'. Thus we end up with "ScalarCost <= RTCost/(double)ExpectedTC + VectorCost/Width".

There are still two cases not taken into account (foldTailByMasking() and equiresScalarEpilogue()) but I think it's OK for this type of check.

Makes sense?

7521

I would suggest keeping result in a variable to avoid coping the (possibly non-trivial) formula.

fhahn updated this revision to Diff 375060.Sep 25 2021, 11:54 AM

Hi Florian,

I do think this change is very important move in the right direction. Thanks for driving it. My main concern is essentially the same as for D75981. I think such things should be done directly on CostModel. I've uploaded alternative implementation which does the same thing as this patch but in the CostModel. Please take a look and let me know if you like that or not. It would be really helpful to hear from others as well.

Thanks for taking a look!

I made a largish adjustment to the patch. It still uses the formula outlined in the original patch, but uses it differently: instead of computing the costs for known/expected trip counts it now computes the minimum trip count given the costs. This new minimum trip count can than be checked against the known/expected trip count. If there is no known/expected trip count, this minimum trip count is used for the minimum iteration check.

It also computes a second minimum to have a bound on the runtime overhead (if the checks fail) compared to the scalar loop. This should guard against failing runtime checks increasing the total runtime by more than a fraction of the scalar loop. (at the moment the fraction is hardcoded 1/10th of the total scalar loop cost, but I'll add an option for it if we converge)

I realized that doing similar thing in the cost model requires a bit of preparational work (about 6 patches). Due to that I think it may be reasonable to land this first. Please find my comments inlined. One thing I would like to ask you. In order to simplify merging with the mentioned 6 patches would be nice if you rebase your work on D109443 (hopefully can be landed quickly) and take D109444 (except line 10182) as part of this change.

Given the large update I did not adjust it yet on D109443. I'd suggest to discuss moving the code into the cost-model separately in your patches. I think at the moment the cost-model is mostly concerned with picking the best vectorization factor and focuses on computing costs for different VFs. I am not sure what the trade-offs are to adding more VF independent cost-modeling there yet (drawbacks are adding even more global state to the cost model, increasing complexity). I assume the motivation for moving it to the cost model is to allow using it when deciding on whether to vectorize given ScalarEpilogueStatus? I think that's best discussed separate with focus on that case.

Matt added a subscriber: Matt.Sep 25 2021, 12:34 PM

FYI this patch can't be applied with arc easily:
error: llvm/test/Transforms/LoopVectorize/X86/pointer-runtime-checks-unprofitable.ll: does not exist in index

lebedev.ri accepted this revision.Sep 25 2021, 3:32 PM

I have checked, and this is even more intrusive solution than the originally-disscussed design (D109296),
and it very successfully vectorizes the loops in question!

I must say, i really like this approach, since it is the only way to truly cover all the cases.
I basically had in mind something along the lines of "do we even need restrict the run-time checks",
i did not propose this as i thought it would be *too* radical.

I've gone over the math (both by hand and via sage, and it checks out)
As far as i'm concerned, this Looks Great.

@fhahn @Ayal thank you so much!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
193

Nit: what does prof mean? PGO profile? Profitable?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2943–2948

This looks like std::max(step, MinProfTripCount)?
Might be worth adding a comment.

7528
7530–7531

I'm rather sure we can't :)

7533–7538

I agree that here the math checks out, the sign is correct.

7536

This is not VecC, it's VecC/VF.
Not an error, but confusing.

7548–7552

So does this round up or down?
Use alignTo()/alignDown()?

This revision is now accepted and ready to land.Sep 25 2021, 3:32 PM

Also, this patch should probably steal it's subject from D109296, the current one does not do it's justice.

ebrevnov added inline comments.Sep 26 2021, 11:19 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447

Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.

7531

The expression 'VecC imul (TC idiv VF)' is not associative since it involves an integer division (idiv). Thus it's not legal to simply replace it with 'TC fmul (VecC fdiv VF)'. Maximum possible error is up to 'VF'. In other words 'TC fmul (VecC fdiv VF)' - 'VecC imul (TC idiv VF)' <= VF. That means 'MinTC1' computed that way is an upper estimate of actual minimum. I don't see anything terribly bad in taking upper estimate but IMHO worth mentioning in comments.

7554

Please use alignTo() instead.

This sounds like a sensible general approach to take to me, from a cost point of view. It seems more like how I once heard the GCC vectorizer cost calculate described.

But there may be more work required on the costing calculations. It might be both under and over estimating the costs in places.
The first obvious problem I ran into was that the runtime checks it emits are not simplified prior to the costs being added for each instruction. The overflow checks seem to create "umul_with_overflow(|step|, TC). But the step is commonly 1 so given half a chance the instruction will be simplified away.
It looks like there were other inefficiencies with the values created too, with select false, x, y or or x, false being added to the cost. But the umul_with_overflow(1, TC) was getting a pretty high cost.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447

Can this have a default value, to prevent the need for multiple constructors?

7517

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.

It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).

And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.

7533

Perhaps add some more assumptions, like the ones mentioned above and that (ScalarC - (VecC / VF)) > 0 from other profitability checks.

Given the large update I did not adjust it yet on D109443. I'd suggest to discuss moving the code into the cost-model separately in your patches.

Sure let's discuss move to cost model separately. Having sad that first four patches in the series (including D109443) are general improvements independent of the change of cost model itself.
Of cause, there is no strong dependency and D109443 should not block this one.

ebrevnov added inline comments.Sep 27 2021, 12:33 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7517

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then).

Since we already compute upper estimate for MinTC1 it doesn't seem to be necessary to do additional adjustment when folding tail. But we probably want/need to do an adjustment for "requiresScalarEpilogue" case.

7533

It's totally fine for '(ScalarC - (VecC / VF))' to be negative since we take max(MinTC1, MinTC2) and MinTC2>=0

ebrevnov accepted this revision.Sep 27 2021, 12:38 AM

Generally LGTM (except few mentioned nits)

nikic added a subscriber: nikic.Sep 27 2021, 12:59 AM

Can you please provide an analysis of the code size impact this has? LLVM is already quite bad at unrolling/vectorizing too aggressively for non-benchmark code (which does not have hot loops) and I'm somewhat concerned this will make the situation even worse. Please do keep in mind that code size has important performance effects on icache and itlb misses, and just jumping over the code doesn't make that cost disappear.

I am not sure what the trade-offs are to adding more VF independent cost-modeling there yet (drawbacks are adding even more global state to the cost model, increasing complexity).

Right, while cost of runtime-checks (at the moment) is VF independent we can decide not to do it inside cost model. Though conceptually it looks much more solid design when CostModel is responsible for the cost based decisions. And total code complexity will be even less than doing cost modeling in different places.

I assume the motivation for moving it to the cost model is to allow using it when deciding on whether to vectorize given ScalarEpilogueStatus?

Right, cost of epilog depends on TC & VF. No excuse to do it outside cost model :-)

I think that's best discussed separate with focus on that case.

Sure can do it later. But will have to do a restructure first (essentially move existing code inside CM) otherwise there will be nasty code duplication.

fhahn updated this revision to Diff 375681.Sep 28 2021, 1:21 PM
fhahn marked 17 inline comments as done.

Updated to address the inline comments. Thanks everyone! I hope I didn't miss any (apologies if I did).

Also, this patch should probably steal it's subject from D109296, the current one does not do it's justice.

Yep I need to update the title & description! I'll do that once I collected a bit more supporting data.

This sounds like a sensible general approach to take to me, from a cost point of view. It seems more like how I once heard the GCC vectorizer cost calculate described.

But there may be more work required on the costing calculations. It might be both under and over estimating the costs in places.
The first obvious problem I ran into was that the runtime checks it emits are not simplified prior to the costs being added for each instruction. The overflow checks seem to create "umul_with_overflow(|step|, TC). But the step is commonly 1 so given half a chance the instruction will be simplified away.
It looks like there were other inefficiencies with the values created too, with select false, x, y or or x, false being added to the cost. But the umul_with_overflow(1, TC) was getting a pretty high cost.

Thanks for raising this point! As you mentioned, we should try to make sure the code we emit for the runtime checks is as close to the final version as possible. The or x, false case should be easy to fix. For the umul_with_overflow I'd need to take a closer look to see where the best place to improve that would be.

Can you please provide an analysis of the code size impact this has? LLVM is already quite bad at unrolling/vectorizing too aggressively for non-benchmark code (which does not have hot loops) and I'm somewhat concerned this will make the situation even worse. Please do keep in mind that code size has important performance effects on icache and itlb misses, and just jumping over the code doesn't make that cost disappear.

I am still collecting the relevant data and I'll share it here once I have more details. On a high level the impact on the number of vectorized loops is relatively small on a large set of benchmarks (SPEC2006/SPEC2017/MultiSource) with a ~1% increase in vectorized loops on ARM64 with -O3.

So far the top size increase seem to be in smaller benchmarks. I need to take a closer look at IRSmk and smg2000 in particular. First inspection of IRSmk shows that is mostly consists of 2-3 large loops for which we did not generate runtime checks for so far, but do with that patch. Still need to check why the increase is so big.

test-suite...s/ASC_Sequoia/IRSmk/IRSmk.test   3240.00    26668.00   723.1%
test-suite...CI_Purple/SMG2000/smg2000.test   154724.00  403628.00  160.9%
test-suite...chmarks/Rodinia/srad/srad.test   4332.00    5736.00    32.4%
test-suite...Source/Benchmarks/sim/sim.test   13628.00   16660.00   22.2%
test-suite...pplications/oggenc/oggenc.test   202936.00  232608.00  14.6%
test-suite...oxyApps-C/miniGMG/miniGMG.test   51588.00   57792.00   12.0%
test-suite...arks/mafft/pairlocalalign.test   356532.00  397024.00  11.4%
test-suite...pps-C/SimpleMOC/SimpleMOC.test   30488.00   33436.00    9.7%
test-suite...oxyApps-C/miniAMR/miniAMR.test   51744.00   55580.00    7.4%
test-suite...rks/FreeBench/pifft/pifft.test   57368.00   59032.00    2.9%
test-suite...8.imagick_r/538.imagick_r.test   1484356.00 1518020.00  2.3%
test-suite...yApps-C++/PENNANT/PENNANT.test   105064.00  107004.00   1.8%
test-suite...rks/tramp3d-v4/tramp3d-v4.test   776340.00  788504.00   1.6%
test-suite...6/464.h264ref/464.h264ref.test   614848.00  623200.00   1.4%
llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
193

Changed to MinProfitableTripCount and add a comment

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447

Can this have a default value, to prevent the need for multiple constructors?

unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.

Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.

That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up

7517

This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.

I added a statement below about the fact that the computations are performed on doubles and later rounded up, giving an upper bound estimate as @ebrevnov suggested. Do you think that's sufficient?

It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).

Yep, that's a fundamental assumption at the moment. Unfortunately I cannot think of a good way to estimate the probability of the checks passing. If we assign a fixed probability per runtime check we are likely ending up with a hard limit like we have at the moment, just expressed differently.

The main motivation of MinTC2 below is to introduce a limit on the impact of a (large) number of runtime checks. The main goal is preventing evaluating large runtime checks for short running loops (at the moment we allow increasing total runtime by 10% due to failing runtime checks, but this could also be lower).

While there might be additional cases where failing runtime checks cause increase in runtime, the same problem already exists even with the hard-coded limit we have at the moment.

We could also change to way we emit runtime checks slightly and break them up across multiple blocks with earlier exits to increase the chances we do not have to evaluate all runtime checks if some fail.

And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.

Agreed, this can be an unfortunate side effect. Again, this is a problem we are already hitting and this patch will add a bit more vectorized loops. But I think in general the impact on the number of loops vectorized with this patch should be relatively small (forSPEC2006/SPEC2017/MultiSource ~1% more loops are vectorized). And I think unfortunately there's not much we can do to avoid this check in general.

One follow-up I think that becomes important is to make sure that we try to use PGO to detect cases where we create dead vector loops and skip vectorizing them.

7531

Thanks, I tried to update the comment to make this clearer.

7533

I think it should not really matter as @ebrevnov said, but it being negative may yield interesting cases to check the cost-modeling.

7536

Changed to VecCOverVF.

7548–7552

Ah that was the one I was looking for! updated thanks!

7554

updated, thanks!

lebedev.ri accepted this revision.Sep 28 2021, 1:41 PM

Thank you, still looks great!

dmgreen added inline comments.Sep 29 2021, 1:49 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
447

unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.

I was expecting it to use = ElementCount(), but this sounds OK too. Is the InnerLoopUnroller value deliberately 1, or would passing zero be better? I imagine it doesn't make much difference in practice.

7517

As far as I understand (correct me if I'm wrong!) we are essentially changing from code that looked like:

if (N < VF) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop

To the same code with, but with a different initial guard value and potentially more runtime checks in places:

if (N < MinProfitableTripCount) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop

That means that if we under-estimate MinProfitableTripCount we go into the runtime checks/vector loop, potentially executing a lot of expensive runtime checks where it is not profitable.
If we _over_ estimate the MinProfitableTripCount then at runtime we will not execute the vector code, falling back to the scalar loop. So we have generated larger/less efficient scalar code that then never executes the vector part, even if it would be profitable to do so.

So we end up in the unfortunate place where either over or under estimating the cost can lead to inefficiencies.

I'm not too worried about the details here. They sounds fine for the most part so long as they are close enough. I'm more worried about the cost of the runtime checks being over-estimated due to them being unsimplified prior to costing. I think that is where the worst regressions I am seeing from this patch come from. Loops where vector code was previously generated and executed are now skipped over. Unfortunately loops with lowish trip counts are common in a lot of code :)

The code in LoopVectorizationCostModel::isMoreProfitable already talks about the cost in terms of PerIterationCost*ceil(TripCount/VF) vs PerIterationCost*floor(TC/VF) though, and I would recommend describing things in the same way here, explaining that RtC + VecC * (TC / VF) + EpiC is a simplification of that.

That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up

There is no cost associated with runtime checks for epilogue vectorization. Thus we should simply initialize 'MinProfitableTripCount' with 'unset' value in Epilogue Vectorizer.

ebrevnov added inline comments.Oct 1 2021, 12:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
458

IHMO we better keep semantic of 'MinProfitableTripCount' and not change it's value to something not coming from profitability considerations. Can easily do required adjustments at use site(s).

fhahn updated this revision to Diff 380490.Oct 18 2021, 11:49 AM

Just a rebase after avoiding emitting the dummy AND x, true as part of the runtime checks in e844f05397b7.

The changes to llvm/test/Transforms/LoopVectorize/ARM/mve-qabs.ll are gone now.

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

Can we make it so that this code doesn't produce a umul_with_overflow if the step is 1?
https://github.com/llvm/llvm-project/blob/8e4c806ed5a481e4d2163c8330f3c3c024d61a36/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp#L2501

(We may need to improve the costmodel for it under AArch64 too, but I've not looked into that quite yet. I'm not sure if @fhahn is planning anything like that either?)

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

Can we make it so that this code doesn't produce a umul_with_overflow if the step is 1?
https://github.com/llvm/llvm-project/blob/8e4c806ed5a481e4d2163c8330f3c3c024d61a36/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp#L2501

(We may need to improve the costmodel for it under AArch64 too, but I've not looked into that quite yet. I'm not sure if @fhahn is planning anything like that either?)

Can you point me at the test where that happens?

Can you point me at the test where that happens?

Hmm I don't know if there is a test. This should hopefully show it: https://godbolt.org/z/6Th4o1s5K

If you print the costs for the runtime checks, you can see they are unsimplified, with the umul being the largest part of the cost:

Cost of 0 for RTCheck   %4 = trunc i64 %0 to i32                                              
Cost of 10 for RTCheck   %mul31 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 %4)
Cost of 0 for RTCheck   %mul.result = extractvalue { i32, i1 } %mul31, 0                      
Cost of 0 for RTCheck   %mul.overflow = extractvalue { i32, i1 } %mul31, 1                    
Cost of 1 for RTCheck   %5 = add i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %6 = sub i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %7 = icmp ugt i32 %6, %2                                              
Cost of 1 for RTCheck   %8 = icmp ult i32 %5, %2                                              
Cost of 1 for RTCheck   %9 = select i1 false, i1 %7, i1 %8                                    
Cost of 1 for RTCheck   %10 = icmp ugt i64 %0, 4294967295                                     
Cost of 1 for RTCheck   %11 = or i1 %9, %10                                                   
Cost of 1 for RTCheck   %12 = or i1 %11, %mul.overflow                                        
Cost of 1 for RTCheck   %13 = or i1 false, %12                                                
LV: Minimum required TC for runtime checks to be profitable:28

I'm not sure if they should be simplified by the builder during construction, simplified prior to costing or the code to create them needs to be more precise.

Can you point me at the test where that happens?

Hmm I don't know if there is a test. This should hopefully show it: https://godbolt.org/z/6Th4o1s5K

If you print the costs for the runtime checks, you can see they are unsimplified, with the umul being the largest part of the cost:

Cost of 0 for RTCheck   %4 = trunc i64 %0 to i32                                              
Cost of 10 for RTCheck   %mul31 = call { i32, i1 } @llvm.umul.with.overflow.i32(i32 1, i32 %4)
Cost of 0 for RTCheck   %mul.result = extractvalue { i32, i1 } %mul31, 0                      
Cost of 0 for RTCheck   %mul.overflow = extractvalue { i32, i1 } %mul31, 1                    
Cost of 1 for RTCheck   %5 = add i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %6 = sub i32 %2, %mul.result                                          
Cost of 1 for RTCheck   %7 = icmp ugt i32 %6, %2                                              
Cost of 1 for RTCheck   %8 = icmp ult i32 %5, %2                                              
Cost of 1 for RTCheck   %9 = select i1 false, i1 %7, i1 %8                                    
Cost of 1 for RTCheck   %10 = icmp ugt i64 %0, 4294967295                                     
Cost of 1 for RTCheck   %11 = or i1 %9, %10                                                   
Cost of 1 for RTCheck   %12 = or i1 %11, %mul.overflow                                        
Cost of 1 for RTCheck   %13 = or i1 false, %12                                                
LV: Minimum required TC for runtime checks to be profitable:28

I'm not sure if they should be simplified by the builder during construction, simplified prior to costing or the code to create them needs to be more precise.

So good and bad news. While the @llvm.umul.with.overflow case
was straight-forward (done in 156f10c840a0), there is still a significant number
of inefficiencies in the IR for these checks. I wasn't particularly looking forward to
arriving at the answer, but it is pretty obvious: if we really want to minimize
the estimated cost for these checks, we have to run instsimplify (or even instcombine)
on them first. The caveat here is that we first need to defuse SCEVExpanderCleaner,
because simplification will lead to dead instructions, and leaving them will again
lead to artificial cost. I feel like that is an improvement that is best done after
this change itself, even though i'm not quite sure yet how to approach it.

haowei added a subscriber: haowei.Oct 27 2021, 3:36 PM

We are seeing these changes (ab1dbcecd6f0a to f3190dedeef9) breaks multiple Polly::CodeGen tests.
E.g. Polly :: CodeGen/aliasing_different_pointer_types.ll failed with message:

Script:
--
: 'RUN: at line 1';   /b/s/w/ir/x/w/staging/llvm_build/bin/opt  -polly-process-unprofitable  -polly-remarks-minimal  -polly-use-llvm-names  -polly-import-jscop-dir=/b/s/w/ir/x/w/llvm-project/polly/test/CodeGen  -polly-codegen-verify  -polly-codegen -S < /b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll | /b/s/w/ir/x/w/staging/llvm_build/bin/FileCheck /b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll
--
Exit Code: 1

Command Output (stderr):
--
/b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll:18:10: error: CHECK: expected string not found in input
; CHECK: %[[orAndTrue:[._a-zA-Z0-9]]] = and i1 true, %[[le1OrLe2]]
         ^
<stdin>:20:19: note: scanning from here
 %6 = or i1 %2, %5
                  ^
<stdin>:20:19: note: with "le1OrLe2" equal to "6"
 %6 = or i1 %2, %5
                  ^
<stdin>:40:2: note: possible intended match here
 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
 ^

Input file: <stdin>
Check file: /b/s/w/ir/x/w/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            .
            .
            .
           15:  %polly.access.A1 = getelementptr double*, double** %A, i64 1024 
           16:  %polly.access.B2 = getelementptr float*, float** %B, i64 0 
           17:  %3 = ptrtoint double** %polly.access.A1 to i64 
           18:  %4 = ptrtoint float** %polly.access.B2 to i64 
           19:  %5 = icmp ule i64 %3, %4 
           20:  %6 = or i1 %2, %5 
check:18'0                       X error: no match found
check:18'1                         with "le1OrLe2" equal to "6"
           21:  br i1 %6, label %polly.start, label %for.cond.pre_entry_bb 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           22:  
check:18'0     ~
           23: for.cond.pre_entry_bb: ; preds = %polly.split_new_and_old 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           24:  br label %for.cond 
check:18'0     ~~~~~~~~~~~~~~~~~~~~
           25:  
check:18'0     ~
            .
            .
            .
           35:  %arrayidx2 = getelementptr inbounds double*, double** %A, i64 %indvars.iv 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           36:  store double* %tmp1, double** %arrayidx2, align 8 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           37:  br label %for.inc 
check:18'0     ~~~~~~~~~~~~~~~~~~~
           38:  
check:18'0     ~
           39: for.inc: ; preds = %for.body 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           40:  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
check:18'2      ?                                                  possible intended match
           41:  br label %for.cond 
check:18'0     ~~~~~~~~~~~~~~~~~~~~
           42:  
check:18'0     ~
           43: polly.merge_new_and_old: ; preds = %polly.exiting, %for.cond 
check:18'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
           44:  br label %for.end 
check:18'0     ~~~~~~~~~~~~~~~~~~~
           45:  
check:18'0     ~
            .
            .
            .
>>>>>>
--

failed build: https://ci.chromium.org/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8832202202604530161/overview

Could you take a look? If it takes a long time to fix, could you revert your change first?

We are seeing these changes (ab1dbcecd6f0a to f3190dedeef9) breaks multiple Polly::CodeGen tests.

Fixed, thank you.

Reverse ping - thanks!
I've mostly implemented interleaved load/store cost modelling for AVX2 (related D111460 is left)
since the original evaluation of this patch, so the effect this has may be different now.

Can we make it so that this code doesn't produce a umul_with_overflow if the step is 1?
https://github.com/llvm/llvm-project/blob/8e4c806ed5a481e4d2163c8330f3c3c024d61a36/llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp#L2501

(We may need to improve the costmodel for it under AArch64 too, but I've not looked into that quite yet. I'm not sure if @fhahn is planning anything like that either?)

I just rebased the patches after some of the recent changes, but there are a ton of crashes now when cleaning up after the SCEVExpander. I need to investigate that first, before collecting a new set of numbers.

fhahn updated this revision to Diff 388715.Nov 20 2021, 10:23 AM

I just rebased the patches after some of the recent changes, but there are a ton of crashes now when cleaning up after the SCEVExpander. I need to investigate that first, before collecting a new set of numbers.

The latest rebase should resolve the crashes!

I also collected a first set of numbers for AArch64 with -O3 for Geekbench and SPEC2017. So far it looks like there is only one notable change: 544.nab_r regressed by 18%. So overall not too bad, but this one needs investigating. If anybody would be able to collect numbers for other platforms/benchmarks, that would also be extremely helpful.

fhahn updated this revision to Diff 397084.Jan 3 2022, 9:26 AM

Rebased, removed some code that became dead and also introduced a new cut-off on the number of runtime checks. This cutoff is there to only control compile-time. This fixes some excessive compile-time regressions, especially in mafft (was +15%).

While the cutoff is not ideal, I think we have to accept a bound on the number of runtime checks, because in degenerate cases vectorization can add a lot of additional code. For one ccase in mafft, vectorizing a loop with 325 runtime checks caused a 200% compile-time regression for that file. Note that the choosen cutoff is already higher than the previous pragma threshold.

Compile-time impact with this patch : http://llvm-compile-time-tracker.com/compare.php?from=05ce750b7968c548cf10a5f1413cf5aac3f1b083&to=f2093a608e33dc48dc2b8ebbbb1d0cd45b9bf6e7&stat=instructions

NewPM-O3: +0.25%
NewPM-ReleaseThinLTO: +0.18%
NewPM-ReleaseLTO-g: +0.18%

fhahn added a comment.Jan 5 2022, 1:31 PM

I also collected a first set of numbers for AArch64 with -O3 for Geekbench and SPEC2017. So far it looks like there is only one notable change: 544.nab_r regressed by 18%. So overall not too bad, but this one needs investigating. If anybody would be able to collect numbers for other platforms/benchmarks, that would also be extremely helpful.

I think I tracked down the remaining regression to the SCEVExpander creating very bad code when expanding SCEV predicates for certain cases (like @dmgreen mentioned earlier). I'll put up a few small patches for SCEVExpander to improve things starting with D116696. After that, I think the patch should be good to go both in terms of compile-time and runtime perf.

Good to go now?

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 7:56 AM
fhahn updated this revision to Diff 420560.Apr 5 2022, 9:45 AM

Rebased

Good to go now?

Basically yes from my prespective, although we should wait for D122126 and maybe D119078 to land. We should definitely make sure that there's a substantial gap between this patch and D119078, because both have the potential to shake things up quite a bit.

fhahn updated this revision to Diff 432026.May 25 2022, 9:30 AM

Rebased on top of the other changes in this area (D122126, D119078). Those patches landed a while ago, so I think now would be a good time to move fowrad with this. Ping :)

dmgreen added inline comments.May 26 2022, 2:19 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1973

Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost?
It appears that the cost of a vscale is coming through as 10, where it should be 1. I'm not sure if any of the other processing done in CM.getInstructionCost is very useful for the scalar runtime checks?
I've added some quick tests (rG75631438e333) to show that the cost should be 1, it is just not treated as a "vectorizable intrinsic" by CM, so given the generic cost of a call.

1980

On a related note, can we get this to print the costs of each of the instructions in the runtime checks? It is useful for debugging when the numbers are incorrect. Otherwise at the moment I believe it just prints the final MinProfitableTripCount without any explanation of how it got there.

10253

Is it worth making "100" a compiler option, so that it is not hardcoded? Could it reuse VectorizeMemoryCheckThreshold, even if it is a different unit?

fhahn updated this revision to Diff 432540.May 27 2022, 6:31 AM
fhahn marked an inline comment as done.

Address latest comments, thanks!

fhahn marked 2 inline comments as done.May 27 2022, 6:34 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1973

Would it be better to call CM.getInstructionCost here, or the base TTI.getInstructionCost?

Good point, as we are only interested in the scalar cost it should be sufficient to use TTI. Updated, thanks!

1980

I update dthe code to print instruction costs here. An interesting result is that it seems our cost estimates for GEPs currently always assume it is used in instructions where constant can be folded via the addressing mode. This is not true when using the pointers in compares as we do here. So the cost is slightly underestimated.

10253

Good point, I updated the code to re-use the option.

dmgreen accepted this revision.May 29 2022, 4:29 AM

Thanks. No further objections from me. LGTM

nikic added a comment.May 30 2022, 4:01 AM

The patch title and description could use an update. It currently sounds like this limits the amount of vectorization, while in practice this makes vectorization much more aggressive (right?)

Just to double check, now that an additional limit has been introduced, did the large code size regressions go away?

fhahn retitled this revision from [LV] Don't vectorize if we can prove RT + vector cost >= scalar cost. to [LV] Vectorize cases with larger number of RT checks, execute only if profitable..Jun 3 2022, 3:02 AM
fhahn edited the summary of this revision. (Show Details)
fhahn marked 2 inline comments as done.Jun 3 2022, 3:06 AM

The patch title and description could use an update. It currently sounds like this limits the amount of vectorization, while in practice this makes vectorization much more aggressive (right?)

Thanks, I just updated the title & description!

Just to double check, now that an additional limit has been introduced, did the large code size regressions go away?

Mostly yes, I updated the description with size numbers for AArch64 with -O3 (should be the same configuration used for the earlier numbers). There are still a few increases, but much less severe ones than originally. Also that's out of 237 binaries. (SPEC2006, SPEC2017rate and MultiSource)

The patch should not be blocked by other patches, or? or any blocker?

fhahn updated this revision to Diff 436937.Jun 14 2022, 2:13 PM

The patch should not be blocked by other patches, or? or any blocker?

No, it should be all good to go now. I am planning on landing the earlier patch in the chain and then this one relatively soon.

fhahn updated this revision to Diff 442086.Jul 4 2022, 6:54 AM

Another rebase just before landing.

This revision was landed with ongoing or failed builds.Jul 4 2022, 7:12 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jul 4 2022, 8:39 AM

Looks like this breaks check-clang: http://45.33.8.238/linux/80254/step_7.txt

fhahn added a comment.Jul 4 2022, 9:25 AM

Looks like this breaks check-clang: http://45.33.8.238/linux/80254/step_7.txt

Thanks, should be fixed by 9eb65727861166543

asmok-g added a subscriber: asmok-g.Jul 7 2022, 9:30 AM

Heads-up: I think this patch caused a mis-compile that's causing some test in Tenserflow to fail. We're still confirming it and working on a reproducer.

alexfh added a subscriber: alexfh.Jul 7 2022, 9:48 AM

Heads-up: I think this patch caused a mis-compile that's causing some test in Tenserflow to fail. We're still confirming it and working on a reproducer.

I got it down to this sample:

void *memmove(void * destination, const void * source, unsigned long num);
void f(char *s, char *d, int g) {
  while (g--) {
    memmove(d, s, 4);
    d += 4;
  }
}

When compiled with --target=x86_64--linux-gnu -O2, before and after this commit, the resulting assembly differs in a way that seems wrong to me:

@@ -1,100 +1,100 @@
        .text
        .file   "input.i"
        .globl  f                               # -- Begin function f
        .p2align        4, 0x90
        .type   f,@function
 f:                                      # @f
        .cfi_startproc
 # %bb.0:
                                         # kill: def $edx killed $edx def $rdx
        testl   %edx, %edx
        je      .LBB0_16
 # %bb.1:
        leal    -1(%rdx), %r8d
-       cmpl    $7, %r8d
+       cmpl    $15, %r8d
        jb      .LBB0_2
 # %bb.3:
        leaq    4(%rdi), %rax
        cmpq    %rsi, %rax
        jbe     .LBB0_6
 # %bb.4:
        leaq    (%rsi,%r8,4), %rax
        addq    $4, %rax
        cmpq    %rdi, %rax
        jbe     .LBB0_6
 .LBB0_2:
        movq    %rsi, %rax
 .LBB0_9:
        leal    -1(%rdx), %r8d
        testb   $7, %dl
        je      .LBB0_13
 # %bb.10:
        movl    %edx, %r9d
        andl    $7, %r9d
        xorl    %esi, %esi
        .p2align        4, 0x90
 .LBB0_11:                               # =>This Inner Loop Header: Depth=1
        movl    (%rdi), %ecx
        movl    %ecx, (%rax)
        addq    $4, %rax
        incq    %rsi
        cmpl    %esi, %r9d
        jne     .LBB0_11
 # %bb.12:
        subl    %esi, %edx
 .LBB0_13:
        cmpl    $7, %r8d
        jb      .LBB0_16
 # %bb.14:
        movl    %edx, %ecx
        xorl    %edx, %edx
        .p2align        4, 0x90
 .LBB0_15:                               # =>This Inner Loop Header: Depth=1
        movl    (%rdi), %esi
        movl    %esi, (%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 4(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 8(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 12(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 16(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 20(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 24(%rax,%rdx,4)
        movl    (%rdi), %esi
        movl    %esi, 28(%rax,%rdx,4)
        addq    $8, %rdx
        cmpl    %edx, %ecx
        jne     .LBB0_15
        jmp     .LBB0_16
 .LBB0_6:
        incq    %r8
        movq    %r8, %r9
        andq    $-8, %r9
        subl    %r9d, %edx
        leaq    (%rsi,%r9,4), %rax
        movd    (%rdi), %xmm0                   # xmm0 = mem[0],zero,zero,zero
        pshufd  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]
        xorl    %ecx, %ecx
        .p2align        4, 0x90
 .LBB0_7:                                # =>This Inner Loop Header: Depth=1
        movdqu  %xmm0, (%rsi,%rcx,4)
        movdqu  %xmm0, 16(%rsi,%rcx,4)
        addq    $8, %rcx
        cmpq    %rcx, %r9
        jne     .LBB0_7
 # %bb.8:
        cmpq    %r9, %r8
        jne     .LBB0_9
 .LBB0_16:
        retq
 .Lfunc_end0:
        .size   f, .Lfunc_end0-f
        .cfi_endproc
                                         # -- End function
-       .ident  "clang version google3-trunk (aa78c5298ea37f2ca8150dc0a1c880be7ec438f4)"
+       .ident  "clang version google3-trunk (644a965c1efef68f22d9495e4cefbb599c214788)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig
alexfh added a comment.Jul 7 2022, 9:55 AM

When compiled with --target=x86_64--linux-gnu -O2, before and after this commit, the resulting assembly differs in a way that seems wrong to me:

After reading the description of the commit I'm not sure about this being wrong, but this sort of a change has definitely caused a difference in the behavior of some numpy C code used by the tensorflow test @asmok-g mentioned.

fhahn added a comment.Jul 7 2022, 2:16 PM

When compiled with --target=x86_64--linux-gnu -O2, before and after this commit, the resulting assembly differs in a way that seems wrong to me:

After reading the description of the commit I'm not sure about this being wrong, but this sort of a change has definitely caused a difference in the behavior of some numpy C code used by the tensorflow test @asmok-g mentioned.

I had a look at the example, but I don't think the patch is at fault here directly. The only difference for the example is that the vector loop is only executed if the loop execute 16 or more iterations vs 8 or more before. This shouldn't impact correctness, unless the code path for the scalar loop is mis-compiled.

Here's the only IR change:

<   %min.iters.check = icmp ult i32 %0, 7
---
>   %min.iters.check = icmp ult i32 %0, 15

Is it possible that the reproducer has been reduced too far? Does it work as expected if vectorization is disabled for the loop via #pragma clang loop vectorize(enable) / #pragma clang loop interleave(enable)?

Is it possible that the reproducer has been reduced too far?

Yes, this was the case. And when I inspected the original code closer, I found a problem with the C code and a problem with the test using it. Thus, false alarm here.

alexfh added a comment.Jul 8 2022, 3:09 AM

By the way, this commit seems to regress the llvm_singlesource / Misc_oourafft benchmark by ~8% on multiple configurations. Not sure what the right tradeoff here is and how representative is the benchmark of real-life workloads, but you might want to look at this.

alexfh added a comment.Jul 8 2022, 3:30 AM

By the way, this commit seems to regress the llvm_singlesource / Misc_oourafft benchmark by ~8% on multiple configurations. Not sure what the right tradeoff here is and how representative is the benchmark of real-life workloads, but you might want to look at this.

By "multiple configurations" I meant multiple microarchitectures and optimization modes (-O3 and FDO).

Hi @fhahn, I believe this patch breaks the AArch64/SVE buildbots. You can see the affect on https://lab.llvm.org/buildbot/#/builders/197/builds/2185. It looks like we were unlucky with the buildbot run that actually contained your patch because this failed for what looks like a temporary CI issue, but all the runs after this point show functional regressions when running LNT. I've manually checked your commit and the one before yours to confirm this is when things start to fail.

fhahn added a comment.Jul 14 2022, 4:51 PM

Hi @fhahn, I believe this patch breaks the AArch64/SVE buildbots. You can see the affect on https://lab.llvm.org/buildbot/#/builders/197/builds/2185. It looks like we were unlucky with the buildbot run that actually contained your patch because this failed for what looks like a temporary CI issue, but all the runs after this point show functional regressions when running LNT. I've manually checked your commit and the one before yours to confirm this is when things start to fail.

Thanks for the heads-up! Would it be possible to share a preprocessed file that gets miscompiled or an LLVM IR function that shows the issue? Otherwise I won't be able to reproduce/investigate the failure as I do not have accesses to SVE hardware.

Hi @fhahn, I believe this patch breaks the AArch64/SVE buildbots. You can see the affect on https://lab.llvm.org/buildbot/#/builders/197/builds/2185. It looks like we were unlucky with the buildbot run that actually contained your patch because this failed for what looks like a temporary CI issue, but all the runs after this point show functional regressions when running LNT. I've manually checked your commit and the one before yours to confirm this is when things start to fail.

Thanks for the heads-up! Would it be possible to share a preprocessed file that gets miscompiled or an LLVM IR function that shows the issue? Otherwise I won't be able to reproduce/investigate the failure as I do not have accesses to SVE hardware.

The issue is easily reproduced via user mode qemu (in my case I did this on x86) if that helps? Otherwise I can see about getting a reproducer but that'll take some time so can you revert the patch in the interim?

The issue is easily reproduced via user mode qemu (in my case I did this on x86) if that helps? Otherwise I can see about getting a reproducer but that'll take some time so can you revert the patch in the interim?

I think I know what the issue is, I'll prepare a fix on Friday, if it take longer I'll revert the change.

fhahn added a comment.Jul 16 2022, 9:21 AM

The issue is easily reproduced via user mode qemu (in my case I did this on x86) if that helps? Otherwise I can see about getting a reproducer but that'll take some time so can you revert the patch in the interim?

I think I know what the issue is, I'll prepare a fix on Friday, if it take longer I'll revert the change.

Should be fixed by aa00fb02c98a. The bot is back to green.