This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function
ClosedPublic

Authored by david-arm on Apr 19 2021, 1:58 AM.

Details

Summary

When vectorising for AArch64 targets if you specify the SVE attribute
we automatically then treat masked loads and stores as legal. Also,
since we have no cost model for masked memory ops we believe it's
cheap to use the masked load/store intrinsics even for fixed width
vectors. This can lead to poor code quality as the intrinsics will
currently be scalarised in the backend. This patch adds a basic
cost model that marks fixed-width masked memory ops as significantly
more expensive than for scalable vectors.

Tests for the cost model are added here:

Transforms/LoopVectorize/AArch64/masked-op-cost.ll

Diff Detail

Event Timeline

david-arm created this revision.Apr 19 2021, 1:58 AM
david-arm requested review of this revision.Apr 19 2021, 1:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2021, 1:58 AM
CarolineConcatto accepted this revision.Apr 20 2021, 1:44 AM

@david-arm I believe this patch is ok.
If it is not possible to make masked stores illegal for fixed vector using isLegalMaskedLoadStore, then I believe that the cost model is another valid solution to avoid it for fixed vectors.
Thank you for the explanation earlier.
I am going to approve and let's hope that this will not be a curse like in Sander's cost model patch.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
921

Hey David, have you used clang-format or it is Phabricator?

This revision is now accepted and ready to land.Apr 20 2021, 1:44 AM
junparser added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
927

would it better to also consider useSVEForFixedLengthVectors here?

david-arm added inline comments.Apr 20 2021, 1:53 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
927

Yes definitely once we have support for lowering masked loads/stores using SVE for fixed width vectors. At the moment though we still continue to scalarise masked loads/stores.

There is work in progress I believe on lowering fixed width masked loads/stores to use SVE so once that patch lands we can update this cost model too.

sdesmalen added inline comments.Apr 20 2021, 1:53 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
928

Should this actually be something to add to BasicTTIImpl.h so that it can be reused for other targets? The cost of implementing a masked memory op would be: NumElts * (cost(load element) + cost(insert element)) <=> NumElts * cost(load element) + ScalarizationOverhead

Then we only have to implement this function for the scalable case, and all other cases can call the BasicTTIImpl.

david-arm added inline comments.Apr 20 2021, 1:58 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
928

I don't think BasicTTIImpl currently has getMaskedMemoryOpCost, but I can look into adding one if we think it's worthwhile? If so, we'd almost certainly want to update the ARM target too, because it does something very similar.

Does that cost you mention above take into account the compare and branch? I'd expect something like:

%1 = icmp
br i1 %1, ...
%2 = load
...
%3 = insertlement .... %2

I think it's important to reflect the cost of the branch here, since that's something the vector version wouldn't have.

david-arm updated this revision to Diff 339163.Apr 21 2021, 3:20 AM
  • Added BasicTTIImpl implementation of getMaskedMemoryOpCost.
  • Updated ARM and AArch64 backends to use the new BaseT version.
david-arm marked an inline comment as done.Apr 21 2021, 3:20 AM
sdesmalen added inline comments.Apr 21 2021, 4:34 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
1026 ↗(On Diff #339163)

this name is really confusing, because I initially thought it was the same as getMaskedMemoryOpCost.

How about getCommonMaskedMemoryOpCost? And adding a comment that it is not the same as getMaskedMemoryOpCost. Perhaps you can also make this method private, as it's not supposed to be exposed outside this class.

1031 ↗(On Diff #339163)

How did this line move here?

1036 ↗(On Diff #339163)

perhaps just personal preference, but I think this reads easier:

InstructionCost AddrExtractCost = IsGatherScatter ? AddrExtractCost = getVectorInstrCost(...) : 0;
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
922

A cost of '2' would only be valid for legal types. It probably needs to consider legalisation, so that the cost of a <vscale x 4 x i64> would be 4.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1441 ↗(On Diff #339163)

Why have you changed the ARM cost-model?

david-arm added inline comments.Apr 21 2021, 4:55 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1441 ↗(On Diff #339163)

Sorry I just assumed this was in line with what you were suggesting before about having a default cost model? ARM currently only does this I believe because there was previously no BasicTTIImpl version and I was hoping that the version in BasicTTIImpl would be an improvement on the previous guess of 8 * NumElements.

Also, this function never gets called for scalars so it seemed a bit odd to explicitly discriminate between vectors and scalars, and perhaps made more sense to just always call the BaseT version?

Also, why is isLegalMaskedLoad returning true if there are not any legal masked loads for that type?
(Not that adding these extra costs isn't a good thing, it's good to have a better default than just 1, and a good cost is useful in many places).

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗(On Diff #339163)

Br and PHI are often free, but were accounted for here. I think the old code might have been fine, and more accurate for arm.

Also, why is isLegalMaskedLoad returning true if there are not any legal masked loads for that type?
(Not that adding these extra costs isn't a good thing, it's good to have a better default than just 1, and a good cost is useful in many places).

This is because as soon as you enable SVE you effectively switch on masked loads and stores. The vectoriser only calls isLegalMaskedLoad with an element type, not a vector type. This means that we can't distinguish between fixed width and scalable vectors.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗(On Diff #339163)

OK sure, I'll revert it then. I'm not sure the BasicTTIImpl is that accurate for AArch64 either, because we treat branches as zero cost for some reason. Also, probably the i1 vector extract cost is too low as well.

This is because as soon as you enable SVE you effectively switch on masked loads and stores. The vectoriser only calls isLegalMaskedLoad with an element type, not a vector type. This means that we can't distinguish between fixed width and scalable vectors.

OK, that makes sense. It won't know the vector width until later..

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗(On Diff #339163)

I think the reasoning is that unconditional branches are often zero cost in modern cpus, in terms of throughput/latency. Conditional branches will depend on the branch predictor, and the number of branches from a scalarized intrinsic can start to break that.

It may be worth adding a few llvm.masked.store/llvm.masked.load cost checks for AArch64, if we don't have them already, to show the costs more clearly.

david-arm updated this revision to Diff 339672.Apr 22 2021, 9:14 AM
  • Reverted ARM backend changes.
  • Added legalisation cost to AArch64 cost function.
  • Addressed other review comments.
david-arm marked 5 inline comments as done.Apr 22 2021, 9:14 AM
sdesmalen accepted this revision.Apr 23 2021, 7:17 AM

Thanks for the updates to the patch, to me this change looks fine now!

This revision was landed with ongoing or failed builds.Apr 26 2021, 3:00 AM
This revision was automatically updated to reflect the committed changes.
Matt added a subscriber: Matt.May 4 2021, 9:59 AM