This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Change the cost of vector insert/extract to 2
ClosedPublic

Authored by dmgreen on Jul 17 2023, 6:49 AM.

Details

Summary

The cost of vector instructions has always been high under AArch64, in order to add a high cost for inserts/extracts, shuffles and scalarization. This is a conservative approach to limit the scope of "unusual" SLP vectorization where the codegen ends up being quite poor, but has always been higher than the "correct" costs would be for any specific core.

This relaxes that, reducing the vector insert/extract cost from 3 to 2. It is a generalization of D142359 to all AArch64 cpus. The ScalarizationOverhead is also overridden for integer vector at the same time, to remove the effect of lane 0 being considered free for integer vectors (something that should only be true for float when scalarizing).

The lower insert/extract cost will reduce the cost of insert, extracts, shuffling and scalarization. The adjustments of ScalaizationOverhead will increase the cost on integer, especially for small vectors. The end result will be lower cost for float and long-integer types, some higher cost for some smaller vectors. This, along with the raw insert/extract cost being lower, will generally mean more vectorization from the Loop and SLP vectorizer.

We may learn to regret this, as that vectorization is not always profitable. In all the benchmarking I have done this is generally an improvement in the overall performance, and I've attempted to address the places where it wasn't with other costmodel adjustments.

Diff Detail

Event Timeline

dmgreen created this revision.Jul 17 2023, 6:49 AM
dmgreen requested review of this revision.Jul 17 2023, 6:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 17 2023, 6:49 AM
Herald added a subscriber: wangpc. · View Herald Transcript
dmgreen edited the summary of this revision. (Show Details)Jul 17 2023, 7:03 AM
SjoerdMeijer added inline comments.Jul 17 2023, 7:21 AM
llvm/test/Analysis/CostModel/AArch64/bswap.ll
13–14

Just a bit of a drive by question first. It's not really caused by this change, I think, but it looks like the cost modelling was already a bit off for these bswaps?

https://godbolt.org/z/d1s4ToP1G

Or am I missing something?

peterwaller-arm resigned from this revision.Jul 17 2023, 7:34 AM

I'm not currently the right person to review this, resigning.

Noted a couple regressions on testcases which explicitly say they're not supposed to be vectorized.

A lot of the numbers for intrinsics are pretty clearly off by a very large amount. If nobody is going to look at them, maybe we should just kill off the tests in question so reviewers don't have to read meaningless updates to them?

llvm/test/Analysis/CostModel/AArch64/fshr.ll
183

Weird cost modeling.

llvm/test/Analysis/CostModel/AArch64/getIntrinsicInstrCost-vector-reverse.ll
25

Weird cost modeling.

llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll
35

Weird cost modeling.

llvm/test/Analysis/CostModel/AArch64/vector-select.ll
692

Cost modeling is weird.

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
52–53

This cost modeling is weird.

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll
204

Regression?

llvm/test/Transforms/VectorCombine/AArch64/load-extractelement-scalarization.ll
521

Regression?

A lot of the numbers for intrinsics are pretty clearly off by a very large amount. If nobody is going to look at them, maybe we should just kill off the tests in question so reviewers don't have to read meaningless updates to them?

A lot of them are at least roughly correct even if they look odd, and it is good to have the test coverage.

llvm/test/Analysis/CostModel/AArch64/bswap.ll
13–14

I don't know if anyone has looked into getting vector bswap costs correct in the past, just scalar. These will just be costed as if they were scalarized, so the changes here are really just saying that the scalarization overhead is changing a little.

I can look into improving some of them, but like you say it's a bit unrelated to this change.

llvm/test/Analysis/CostModel/AArch64/fshr.ll
183

Yep I think we have only looked at cost modelling for constant funnel shifts, and those were added fairly recently. I believe the codegen should also be improved for the variable case.

llvm/test/Analysis/CostModel/AArch64/getIntrinsicInstrCost-vector-reverse.ll
25

I agree, but the codegen looks odd without +bf16 too: https://godbolt.org/z/oTG5ae6nP

llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll
35

Do you mean because of the tbl? We have never costed tbls as cheap. I'm not sure if that would be profitable or not, and feels very much like a different issue.

llvm/test/Analysis/CostModel/AArch64/vector-select.ll
692

Because it is too low? It is scalarized without +fullfp16. That codegen could be better, and it looks like the cost is a bit low, not accounting for the scalarization cost of the extracts. I don't think we have focussed much in the past on the combination of fp16 code without fullfp16.

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
52–53

This is an in-order reduction.

llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll
204

I had looked into these. This test case was added without the underlying issue being fixed (fusing fmul and fadd). The tests were changed by the increased cost in ld1r instructions.

In this case it just profitable again now. You can see it picks 2x vectorization though, not 4x, which seems to come because of the insertelement <2 x float> <float poison, float 3.000000e+00>, float [[X:%.*]], i32 0, which is counts as the cost of a constant vector with x inserted into the bottom lane, both of which are incorrectly counted as zero.

llvm/test/Transforms/VectorCombine/AArch64/load-extractelement-scalarization.ll
521

The cost of an extract of lane zero is still 0 (which is known to be wrong but doesn't look like something we can change without causing too many regressions. I was really hoping to remove it for integer type at the same time as this, but it looks like it causes too many problems to remove. I'm hoping that can be improved in the future, and that will hopefully be easier if the base scalar cost is lower).

This seems to already handled by instruction selection https://godbolt.org/z/7GEcxo8WT, so shouldnt be a problem on its own. I can change the test to use lane 1 to show it still applies for other lanes.

efriedma added inline comments.Jul 19 2023, 1:24 AM
llvm/test/Analysis/CostModel/AArch64/getIntrinsicInstrCost-vector-reverse.ll
25

Still way overestimating the cost... but yes.

llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll
35

We should be modeling the fact that tbl exists, at least. (I mean, it doesn't need to be super-cheap, but basically all ARM chips have a reasonably fast tbl.)

llvm/test/Analysis/CostModel/AArch64/vector-select.ll
692

Wait, we scalarize this? I thought I checked this, but must not have. We really shouldn't scalarize, though.

llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
52–53

Then why does it cost 1 at VF 4?

llvm/test/Transforms/VectorCombine/AArch64/load-extractelement-scalarization.ll
521

Okay.

SjoerdMeijer added inline comments.Jul 20 2023, 5:50 AM
llvm/test/Analysis/CostModel/AArch64/vector-select.ll
692

Without fullfp16 support, which is what this is checking with "COST-NOFP16-NEXT", I expect this to get scalarised.

dmgreen added inline comments.Jul 27 2023, 1:22 AM
llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll
52–53

It, for some reason, prints the costs twice and these are matching two different things. There is some details in 2e0bf67df1437cb0156d7f5dd9e1b701749f96ca. I'll rebased on the updated costs.

dmgreen updated this revision to Diff 544649.Jul 27 2023, 1:29 AM

Rebase over adjusted costs.

efriedma accepted this revision.Jul 27 2023, 11:10 AM

LGTM. I'll trust your benchmarking on this.

llvm/test/Analysis/CostModel/AArch64/cttz.ll
108

Worth noting the actual cost here is 4 instructions. (We don't scalarize it; we lower using cnt.)

This revision is now accepted and ready to land.Jul 27 2023, 11:10 AM

Thanks.

This has a fairly high chance of causing some problems somewhere. Please send reports of any regressions and I can see about addressing them.

This revision was landed with ongoing or failed builds.Jul 28 2023, 1:27 PM
This revision was automatically updated to reflect the committed changes.