This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64TargetTransformInfo.h
5/6
AArch64TargetTransformInfo.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
masked-op-cost.ll

Differential D100745

[AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function
ClosedPublic

Authored by david-arm on Apr 19 2021, 1:58 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
CarolineConcatto
dmgreen
peterwaller-arm

Commits

rGa458b7855e1a: [AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function

Summary

When vectorising for AArch64 targets if you specify the SVE attribute
we automatically then treat masked loads and stores as legal. Also,
since we have no cost model for masked memory ops we believe it's
cheap to use the masked load/store intrinsics even for fixed width
vectors. This can lead to poor code quality as the intrinsics will
currently be scalarised in the backend. This patch adds a basic
cost model that marks fixed-width masked memory ops as significantly
more expensive than for scalable vectors.

Tests for the cost model are added here:

Transforms/LoopVectorize/AArch64/masked-op-cost.ll

Diff Detail

Event Timeline

david-arm created this revision.Apr 19 2021, 1:58 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald TranscriptApr 19 2021, 1:58 AM

david-arm requested review of this revision.Apr 19 2021, 1:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2021, 1:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B99434: Diff 338448.Apr 19 2021, 2:39 AM

@david-arm I believe this patch is ok.
If it is not possible to make masked stores illegal for fixed vector using isLegalMaskedLoadStore, then I believe that the cost model is another valid solution to avoid it for fixed vectors.
Thank you for the explanation earlier.
I am going to approve and let's hope that this will not be a curse like in Sander's cost model patch.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
921	Hey David, have you used clang-format or it is Phabricator?

This revision is now accepted and ready to land.Apr 20 2021, 1:44 AM

junparser added a subscriber: junparser.Apr 20 2021, 1:47 AM

junparser added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
927	would it better to also consider useSVEForFixedLengthVectors here?

david-arm added inline comments.Apr 20 2021, 1:53 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
927	Yes definitely once we have support for lowering masked loads/stores using SVE for fixed width vectors. At the moment though we still continue to scalarise masked loads/stores. There is work in progress I believe on lowering fixed width masked loads/stores to use SVE so once that patch lands we can update this cost model too.

sdesmalen added inline comments.Apr 20 2021, 1:53 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
928	Should this actually be something to add to BasicTTIImpl.h so that it can be reused for other targets? The cost of implementing a masked memory op would be: `NumElts * (cost(load element) + cost(insert element)) <=> NumElts * cost(load element) + ScalarizationOverhead` Then we only have to implement this function for the scalable case, and all other cases can call the BasicTTIImpl.

david-arm added inline comments.Apr 20 2021, 1:58 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
928	I don't think BasicTTIImpl currently has getMaskedMemoryOpCost, but I can look into adding one if we think it's worthwhile? If so, we'd almost certainly want to update the ARM target too, because it does something very similar. Does that cost you mention above take into account the compare and branch? I'd expect something like: %1 = icmp br i1 %1, ... %2 = load ... %3 = insertlement .... %2 I think it's important to reflect the cost of the branch here, since that's something the vector version wouldn't have.

Added BasicTTIImpl implementation of getMaskedMemoryOpCost.
Updated ARM and AArch64 backends to use the new BaseT version.

david-arm marked an inline comment as done.Apr 21 2021, 3:20 AM

Harbormaster completed remote builds in B99930: Diff 339163.Apr 21 2021, 3:54 AM

sdesmalen added inline comments.Apr 21 2021, 4:34 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1026 ↗	(On Diff #339163)	this name is really confusing, because I initially thought it was the same as getMaskedMemoryOpCost. How about `getCommonMaskedMemoryOpCost`? And adding a comment that it is not the same as getMaskedMemoryOpCost. Perhaps you can also make this method `private`, as it's not supposed to be exposed outside this class.
1031 ↗	(On Diff #339163)	How did this line move here?
1036 ↗	(On Diff #339163)	perhaps just personal preference, but I think this reads easier: InstructionCost AddrExtractCost = IsGatherScatter ? AddrExtractCost = getVectorInstrCost(...) : 0;
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
922	A cost of '2' would only be valid for legal types. It probably needs to consider legalisation, so that the cost of a <vscale x 4 x i64> would be 4.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1441 ↗	(On Diff #339163)	Why have you changed the ARM cost-model?

david-arm added inline comments.Apr 21 2021, 4:55 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1441 ↗	(On Diff #339163)	Sorry I just assumed this was in line with what you were suggesting before about having a default cost model? ARM currently only does this I believe because there was previously no BasicTTIImpl version and I was hoping that the version in BasicTTIImpl would be an improvement on the previous guess of 8 * NumElements. Also, this function never gets called for scalars so it seemed a bit odd to explicitly discriminate between vectors and scalars, and perhaps made more sense to just always call the BaseT version?

Also, why is isLegalMaskedLoad returning true if there are not any legal masked loads for that type?
(Not that adding these extra costs isn't a good thing, it's good to have a better default than just 1, and a good cost is useful in many places).

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗	(On Diff #339163)	Br and PHI are often free, but were accounted for here. I think the old code might have been fine, and more accurate for arm.

In D100745#2704767, @dmgreen wrote:

Also, why is isLegalMaskedLoad returning true if there are not any legal masked loads for that type?
(Not that adding these extra costs isn't a good thing, it's good to have a better default than just 1, and a good cost is useful in many places).

This is because as soon as you enable SVE you effectively switch on masked loads and stores. The vectoriser only calls isLegalMaskedLoad with an element type, not a vector type. This means that we can't distinguish between fixed width and scalable vectors.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗	(On Diff #339163)	OK sure, I'll revert it then. I'm not sure the BasicTTIImpl is that accurate for AArch64 either, because we treat branches as zero cost for some reason. Also, probably the i1 vector extract cost is too low as well.

This is because as soon as you enable SVE you effectively switch on masked loads and stores. The vectoriser only calls isLegalMaskedLoad with an element type, not a vector type. This means that we can't distinguish between fixed width and scalable vectors.

OK, that makes sense. It won't know the vector width until later..

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗	(On Diff #339163)	I think the reasoning is that unconditional branches are often zero cost in modern cpus, in terms of throughput/latency. Conditional branches will depend on the branch predictor, and the number of branches from a scalarized intrinsic can start to break that. It may be worth adding a few llvm.masked.store/llvm.masked.load cost checks for AArch64, if we don't have them already, to show the costs more clearly.

Reverted ARM backend changes.
Added legalisation cost to AArch64 cost function.
Addressed other review comments.

david-arm marked 5 inline comments as done.Apr 22 2021, 9:14 AM

Harbormaster completed remote builds in B100303: Diff 339672.Apr 22 2021, 11:32 AM

Thanks for the updates to the patch, to me this change looks fine now!

This revision was landed with ongoing or failed builds.Apr 26 2021, 3:00 AM

Closed by commit rGa458b7855e1a: [AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGa458b7855e1a: [AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function.

Matt added a subscriber: Matt.May 4 2021, 9:59 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

4 lines

AArch64TargetTransformInfo.cpp

16 lines

test/

Transforms/

LoopVectorize/

AArch64/

masked-op-cost.ll

92 lines

Diff 338448

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	public:
Optional<unsigned> getMaxVScale() const {		Optional<unsigned> getMaxVScale() const {
if (ST->hasSVE())		if (ST->hasSVE())
return AArch64::SVEMaxBitsPerVector / AArch64::SVEBitsPerBlock;		return AArch64::SVEMaxBitsPerVector / AArch64::SVEBitsPerBlock;
return BaseT::getMaxVScale();		return BaseT::getMaxVScale();
}		}

unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);

		InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
		Align Alignment, unsigned AddressSpace,
		TTI::TargetCostKind CostKind);

InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment,		Align Alignment,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
TTI::CastContextHint CCH,		TTI::CastContextHint CCH,
▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 909 Lines • ▼ Show 20 Lines	AArch64TTIImpl::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {
Options.NumLoadsPerBlock = Options.MaxNumLoads;		Options.NumLoadsPerBlock = Options.MaxNumLoads;
// TODO: Though vector loads usually perform well on AArch64, in some targets		// TODO: Though vector loads usually perform well on AArch64, in some targets
// they may wake up the FP unit, which raises the power consumption. Perhaps		// they may wake up the FP unit, which raises the power consumption. Perhaps
// they could be used with no holds barred (-O3).		// they could be used with no holds barred (-O3).
Options.LoadSizes = {8, 4, 2, 1};		Options.LoadSizes = {8, 4, 2, 1};
return Options;		return Options;
}		}

		InstructionCost
		AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src, Align Alignment,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type Src, Align Alignment, - unsigned AddressSpace, +AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type Src, + Align Alignment, unsigned AddressSpace, Lint: Pre-merge checks: clang-format: please reformat the code ``` -AArch64TTIImpl::getMaskedMemoryOpCost(unsigned…
		unsigned AddressSpace,
		TTI::TargetCostKind CostKind) {
		CarolineConcattoUnsubmitted Done Reply Inline Actions Hey David, have you used clang-format or it is Phabricator? CarolineConcatto: Hey David, have you used clang-format or it is Phabricator?
		if (!isa<ScalableVectorType>(Src)) {
		sdesmalenUnsubmitted Done Reply Inline Actions A cost of '2' would only be valid for legal types. It probably needs to consider legalisation, so that the cost of a <vscale x 4 x i64> would be 4. sdesmalen: A cost of '2' would only be valid for legal types. It probably needs to consider legalisation…
		// This involves a comparison, a branch and the load/store itself. The
		// single scalar cost of 8 is an estimate to reflect the expense of the
		// branch.
		unsigned ScalarCost = 8;
		if (auto *VecTy = dyn_cast<FixedVectorType>(Src))
		junparserUnsubmitted Not Done Reply Inline Actions would it better to also consider useSVEForFixedLengthVectors here? junparser: would it better to also consider useSVEForFixedLengthVectors here?
		david-armAuthorUnsubmitted Done Reply Inline Actions Yes definitely once we have support for lowering masked loads/stores using SVE for fixed width vectors. At the moment though we still continue to scalarise masked loads/stores. There is work in progress I believe on lowering fixed width masked loads/stores to use SVE so once that patch lands we can update this cost model too. david-arm: Yes definitely once we have support for lowering masked loads/stores using SVE for fixed width…
		ScalarCost *= VecTy->getNumElements();
		sdesmalenUnsubmitted Done Reply Inline Actions Should this actually be something to add to BasicTTIImpl.h so that it can be reused for other targets? The cost of implementing a masked memory op would be: `NumElts * (cost(load element) + cost(insert element)) <=> NumElts * cost(load element) + ScalarizationOverhead` Then we only have to implement this function for the scalable case, and all other cases can call the BasicTTIImpl. sdesmalen: Should this actually be something to add to BasicTTIImpl.h so that it can be reused for other…
		david-armAuthorUnsubmitted Done Reply Inline Actions I don't think BasicTTIImpl currently has getMaskedMemoryOpCost, but I can look into adding one if we think it's worthwhile? If so, we'd almost certainly want to update the ARM target too, because it does something very similar. Does that cost you mention above take into account the compare and branch? I'd expect something like: %1 = icmp br i1 %1, ... %2 = load ... %3 = insertlement .... %2 I think it's important to reflect the cost of the branch here, since that's something the vector version wouldn't have. david-arm: I don't think BasicTTIImpl currently has getMaskedMemoryOpCost, but I can look into adding one…
		return ScalarCost;
		}
		return 2;
		}

InstructionCost AArch64TTIImpl::getGatherScatterOpCost(		InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
unsigned Opcode, Type DataTy, const Value Ptr, bool VariableMask,		unsigned Opcode, Type DataTy, const Value Ptr, bool VariableMask,
Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) {		Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) {

if (!isa<ScalableVectorType>(DataTy))		if (!isa<ScalableVectorType>(DataTy))
return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,		return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
Alignment, CostKind, I);		Alignment, CostKind, I);
auto *VT = cast<VectorType>(DataTy);		auto *VT = cast<VectorType>(DataTy);
▲ Show 20 Lines • Show All 491 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/masked-op-cost.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -loop-vectorize -force-vector-interleave=1 -S -debug < %s 2>%t \| FileCheck %s
				; RUN: cat %t \| FileCheck %s --check-prefix=CHECK-COST

				target triple = "aarch64-unknown-linux-gnu"

				; CHECK-COST: Checking a loop in "fixed_width"
				; CHECK-COST: Found an estimated cost of 16 for VF 2 For instruction: store i32 2, i32* %arrayidx1, align 4
				; CHECK-COST: Found an estimated cost of 32 for VF 4 For instruction: store i32 2, i32* %arrayidx1, align 4
				; CHECK-COST: Selecting VF: 1.

				; We should decide this loop is not worth vectorising using fixed width vectors
				define void @fixed_width(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i64 %n) #0 {
				; CHECK-LABEL: @fixed_width(
				; CHECK-NOT: vector.body
				entry:
				%cmp6 = icmp sgt i64 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.inc
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void

				for.body: ; preds = %for.body.preheader, %for.inc
				%i.07 = phi i64 [ %inc, %for.inc ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%tobool.not = icmp eq i32 %0, 0
				br i1 %tobool.not, label %for.inc, label %if.then

				if.then: ; preds = %for.body
				%arrayidx1 = getelementptr inbounds i32, i32* %a, i64 %i.07
				store i32 2, i32* %arrayidx1, align 4
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%inc = add nuw nsw i64 %i.07, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
				}


				; CHECK-COST: Checking a loop in "scalable"
				; CHECK-COST: Found an estimated cost of 2 for VF vscale x 4 For instruction: store i32 2, i32* %arrayidx1, align 4

				define void @scalable(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i64 %n) #0 {
				; CHECK-LABEL: @scalable(
				; CHECK: vector.body
				; CHECK: call void @llvm.masked.store.nxv4i32.p0nxv4i32
				entry:
				%cmp6 = icmp sgt i64 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.inc
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void

				for.body: ; preds = %for.body.preheader, %for.inc
				%i.07 = phi i64 [ %inc, %for.inc ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%tobool.not = icmp eq i32 %0, 0
				br i1 %tobool.not, label %for.inc, label %if.then

				if.then: ; preds = %for.body
				%arrayidx1 = getelementptr inbounds i32, i32* %a, i64 %i.07
				store i32 2, i32* %arrayidx1, align 4
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%inc = add nuw nsw i64 %i.07, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !0
				}

				attributes #0 = { "target-features"="+neon,+sve" }

				!0 = distinct !{!0, !1, !2, !3, !4}
				!1 = !{!"llvm.loop.mustprogress"}
				!2 = !{!"llvm.loop.vectorize.width", i32 4}
				!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				!4 = !{!"llvm.loop.vectorize.enable", i1 true}