This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
3/3
BasicTTIImpl.h
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64TargetTransformInfo.h
5/6
AArch64TargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/AArch64/
-
CostModel/
-
AArch64/
-
masked_ldst.ll
-
Transforms/LoopVectorize/AArch64/
-
LoopVectorize/
-
AArch64/
-
masked-op-cost.ll
-
vector-reverse-mask4.ll

Differential D100745

[AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function
ClosedPublic

Authored by david-arm on Apr 19 2021, 1:58 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
CarolineConcatto
dmgreen
peterwaller-arm

Commits

rGa458b7855e1a: [AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function

Summary

When vectorising for AArch64 targets if you specify the SVE attribute
we automatically then treat masked loads and stores as legal. Also,
since we have no cost model for masked memory ops we believe it's
cheap to use the masked load/store intrinsics even for fixed width
vectors. This can lead to poor code quality as the intrinsics will
currently be scalarised in the backend. This patch adds a basic
cost model that marks fixed-width masked memory ops as significantly
more expensive than for scalable vectors.

Tests for the cost model are added here:

Transforms/LoopVectorize/AArch64/masked-op-cost.ll

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Apr 19 2021, 1:58 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald TranscriptApr 19 2021, 1:58 AM

david-arm requested review of this revision.Apr 19 2021, 1:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2021, 1:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B99434: Diff 338448.Apr 19 2021, 2:39 AM

@david-arm I believe this patch is ok.
If it is not possible to make masked stores illegal for fixed vector using isLegalMaskedLoadStore, then I believe that the cost model is another valid solution to avoid it for fixed vectors.
Thank you for the explanation earlier.
I am going to approve and let's hope that this will not be a curse like in Sander's cost model patch.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1044	Hey David, have you used clang-format or it is Phabricator?

This revision is now accepted and ready to land.Apr 20 2021, 1:44 AM

junparser added a subscriber: junparser.Apr 20 2021, 1:47 AM

junparser added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1050	would it better to also consider useSVEForFixedLengthVectors here?

david-arm added inline comments.Apr 20 2021, 1:53 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1050	Yes definitely once we have support for lowering masked loads/stores using SVE for fixed width vectors. At the moment though we still continue to scalarise masked loads/stores. There is work in progress I believe on lowering fixed width masked loads/stores to use SVE so once that patch lands we can update this cost model too.

sdesmalen added inline comments.Apr 20 2021, 1:53 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1051	Should this actually be something to add to BasicTTIImpl.h so that it can be reused for other targets? The cost of implementing a masked memory op would be: `NumElts * (cost(load element) + cost(insert element)) <=> NumElts * cost(load element) + ScalarizationOverhead` Then we only have to implement this function for the scalable case, and all other cases can call the BasicTTIImpl.

david-arm added inline comments.Apr 20 2021, 1:58 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1051	I don't think BasicTTIImpl currently has getMaskedMemoryOpCost, but I can look into adding one if we think it's worthwhile? If so, we'd almost certainly want to update the ARM target too, because it does something very similar. Does that cost you mention above take into account the compare and branch? I'd expect something like: %1 = icmp br i1 %1, ... %2 = load ... %3 = insertlement .... %2 I think it's important to reflect the cost of the branch here, since that's something the vector version wouldn't have.

Added BasicTTIImpl implementation of getMaskedMemoryOpCost.
Updated ARM and AArch64 backends to use the new BaseT version.

david-arm marked an inline comment as done.Apr 21 2021, 3:20 AM

Harbormaster completed remote builds in B99930: Diff 339163.Apr 21 2021, 3:54 AM

sdesmalen added inline comments.Apr 21 2021, 4:34 AM

llvm/include/llvm/CodeGen/BasicTTIImpl.h
1087–1088	this name is really confusing, because I initially thought it was the same as getMaskedMemoryOpCost. How about `getCommonMaskedMemoryOpCost`? And adding a comment that it is not the same as getMaskedMemoryOpCost. Perhaps you can also make this method `private`, as it's not supposed to be exposed outside this class.
1088	How did this line move here?
1089	perhaps just personal preference, but I think this reads easier: InstructionCost AddrExtractCost = IsGatherScatter ? AddrExtractCost = getVectorInstrCost(...) : 0;
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1045	A cost of '2' would only be valid for legal types. It probably needs to consider legalisation, so that the cost of a <vscale x 4 x i64> would be 4.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1441 ↗	(On Diff #339163)	Why have you changed the ARM cost-model?

david-arm added inline comments.Apr 21 2021, 4:55 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1441 ↗	(On Diff #339163)	Sorry I just assumed this was in line with what you were suggesting before about having a default cost model? ARM currently only does this I believe because there was previously no BasicTTIImpl version and I was hoping that the version in BasicTTIImpl would be an improvement on the previous guess of 8 * NumElements. Also, this function never gets called for scalars so it seemed a bit odd to explicitly discriminate between vectors and scalars, and perhaps made more sense to just always call the BaseT version?

Also, why is isLegalMaskedLoad returning true if there are not any legal masked loads for that type?
(Not that adding these extra costs isn't a good thing, it's good to have a better default than just 1, and a good cost is useful in many places).

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗	(On Diff #339163)	Br and PHI are often free, but were accounted for here. I think the old code might have been fine, and more accurate for arm.

In D100745#2704767, @dmgreen wrote:

Also, why is isLegalMaskedLoad returning true if there are not any legal masked loads for that type?
(Not that adding these extra costs isn't a good thing, it's good to have a better default than just 1, and a good cost is useful in many places).

This is because as soon as you enable SVE you effectively switch on masked loads and stores. The vectoriser only calls isLegalMaskedLoad with an element type, not a vector type. This means that we can't distinguish between fixed width and scalable vectors.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗	(On Diff #339163)	OK sure, I'll revert it then. I'm not sure the BasicTTIImpl is that accurate for AArch64 either, because we treat branches as zero cost for some reason. Also, probably the i1 vector extract cost is too low as well.

This is because as soon as you enable SVE you effectively switch on masked loads and stores. The vectoriser only calls isLegalMaskedLoad with an element type, not a vector type. This means that we can't distinguish between fixed width and scalable vectors.

OK, that makes sense. It won't know the vector width until later..

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1446 ↗	(On Diff #339163)	I think the reasoning is that unconditional branches are often zero cost in modern cpus, in terms of throughput/latency. Conditional branches will depend on the branch predictor, and the number of branches from a scalarized intrinsic can start to break that. It may be worth adding a few llvm.masked.store/llvm.masked.load cost checks for AArch64, if we don't have them already, to show the costs more clearly.

Reverted ARM backend changes.
Added legalisation cost to AArch64 cost function.
Addressed other review comments.

david-arm marked 5 inline comments as done.Apr 22 2021, 9:14 AM

Harbormaster completed remote builds in B100303: Diff 339672.Apr 22 2021, 11:32 AM

Thanks for the updates to the patch, to me this change looks fine now!

This revision was landed with ongoing or failed builds.Apr 26 2021, 3:00 AM

Closed by commit rGa458b7855e1a: [AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGa458b7855e1a: [AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function.

Matt added a subscriber: Matt.May 4 2021, 9:59 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

BasicTTIImpl.h

97 lines

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

4 lines

AArch64TargetTransformInfo.cpp

11 lines

test/

Analysis/

CostModel/

AArch64/

masked_ldst.ll

142 lines

Transforms/

LoopVectorize/

AArch64/

masked-op-cost.ll

92 lines

vector-reverse-mask4.ll

3 lines

Diff 340467

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	switch (M) {
case TTI::MIM_PostInc:		case TTI::MIM_PostInc:
return ISD::POST_INC;		return ISD::POST_INC;
case TTI::MIM_PostDec:		case TTI::MIM_PostDec:
return ISD::POST_DEC;		return ISD::POST_DEC;
}		}
llvm_unreachable("Unexpected MemIndexedMode");		llvm_unreachable("Unexpected MemIndexedMode");
}		}

		InstructionCost getCommonMaskedMemoryOpCost(unsigned Opcode, Type *DataTy,
		Align Alignment,
		bool VariableMask,
		bool IsGatherScatter,
		TTI::TargetCostKind CostKind) {
		auto *VT = cast<FixedVectorType>(DataTy);
		// Assume the target does not have support for gather/scatter operations
		// and provide a rough estimate.
		//
		// First, compute the cost of the individual memory operations.
		InstructionCost AddrExtractCost =
		IsGatherScatter
		? getVectorInstrCost(Instruction::ExtractElement,
		FixedVectorType::get(
		PointerType::get(VT->getElementType(), 0),
		VT->getNumElements()),
		-1)
		: 0;
		InstructionCost LoadCost =
		VT->getNumElements() *
		(AddrExtractCost +
		getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind));

		// Next, compute the cost of packing the result in a vector.
		int PackingCost = getScalarizationOverhead(VT, Opcode != Instruction::Store,
		Opcode == Instruction::Store);

		InstructionCost ConditionalCost = 0;
		if (VariableMask) {
		// Compute the cost of conditionally executing the memory operations with
		// variable masks. This includes extracting the individual conditions, a
		// branches and PHIs to combine the results.
		// NOTE: Estimating the cost of conditionally executing the memory
		// operations accurately is quite difficult and the current solution
		// provides a very rough estimate only.
		ConditionalCost =
		VT->getNumElements() *
		(getVectorInstrCost(
		Instruction::ExtractElement,
		FixedVectorType::get(Type::getInt1Ty(DataTy->getContext()),
		VT->getNumElements()),
		-1) +
		getCFInstrCost(Instruction::Br, CostKind) +
		getCFInstrCost(Instruction::PHI, CostKind));
		}

		return LoadCost + PackingCost + ConditionalCost;
		}

protected:		protected:
explicit BasicTTIImplBase(const TargetMachine *TM, const DataLayout &DL)		explicit BasicTTIImplBase(const TargetMachine *TM, const DataLayout &DL)
: BaseT(DL) {}		: BaseT(DL) {}
virtual ~BasicTTIImplBase() = default;		virtual ~BasicTTIImplBase() = default;

using TargetTransformInfoImplBase::DL;		using TargetTransformInfoImplBase::DL;

public:		public:
▲ Show 20 Lines • Show All 819 Lines • ▼ Show 20 Lines	if (Src->isVectorTy() &&
Opcode != Instruction::Store,		Opcode != Instruction::Store,
Opcode == Instruction::Store);		Opcode == Instruction::Store);
}		}
}		}

return Cost;		return Cost;
}		}

		InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *DataTy,
		Align Alignment, unsigned AddressSpace,
		TTI::TargetCostKind CostKind) {
		return getCommonMaskedMemoryOpCost(Opcode, DataTy, Alignment, true, false,
		CostKind);
		}

InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment,		Align Alignment,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I = nullptr) {		const Instruction *I = nullptr) {
auto *VT = cast<FixedVectorType>(DataTy);		return getCommonMaskedMemoryOpCost(Opcode, DataTy, Alignment, VariableMask,
		sdesmalenUnsubmitted Done Reply Inline Actions How did this line move here? sdesmalen: How did this line move here?
		sdesmalenUnsubmitted Done Reply Inline Actions this name is really confusing, because I initially thought it was the same as getMaskedMemoryOpCost. How about `getCommonMaskedMemoryOpCost`? And adding a comment that it is not the same as getMaskedMemoryOpCost. Perhaps you can also make this method `private`, as it's not supposed to be exposed outside this class. sdesmalen: this name is really confusing, because I initially thought it was the same as…
// Assume the target does not have support for gather/scatter operations		true, CostKind);
		sdesmalenUnsubmitted Done Reply Inline Actions perhaps just personal preference, but I think this reads easier: InstructionCost AddrExtractCost = IsGatherScatter ? AddrExtractCost = getVectorInstrCost(...) : 0; sdesmalen: perhaps just personal preference, but I think this reads easier: InstructionCost…
// and provide a rough estimate.
//
// First, compute the cost of extracting the individual addresses and the
// individual memory operations.
InstructionCost LoadCost =
VT->getNumElements() *
(getVectorInstrCost(
Instruction::ExtractElement,
FixedVectorType::get(PointerType::get(VT->getElementType(), 0),
VT->getNumElements()),
-1) +
getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind));

// Next, compute the cost of packing the result in a vector.
int PackingCost = getScalarizationOverhead(VT, Opcode != Instruction::Store,
Opcode == Instruction::Store);

InstructionCost ConditionalCost = 0;
if (VariableMask) {
// Compute the cost of conditionally executing the memory operations with
// variable masks. This includes extracting the individual conditions, a
// branches and PHIs to combine the results.
// NOTE: Estimating the cost of conditionally executing the memory
// operations accurately is quite difficult and the current solution
// provides a very rough estimate only.
ConditionalCost =
VT->getNumElements() *
(getVectorInstrCost(
Instruction::ExtractElement,
FixedVectorType::get(Type::getInt1Ty(DataTy->getContext()),
VT->getNumElements()),
-1) +
getCFInstrCost(Instruction::Br, CostKind) +
getCFInstrCost(Instruction::PHI, CostKind));
}

return LoadCost + PackingCost + ConditionalCost;
}		}

InstructionCost getInterleavedMemoryOpCost(		InstructionCost getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond = false, bool UseMaskForGaps = false) {		bool UseMaskForCond = false, bool UseMaskForGaps = false) {
auto *VT = cast<FixedVectorType>(VecTy);		auto *VT = cast<FixedVectorType>(VecTy);

▲ Show 20 Lines • Show All 1,039 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	public:
Optional<unsigned> getMaxVScale() const {		Optional<unsigned> getMaxVScale() const {
if (ST->hasSVE())		if (ST->hasSVE())
return AArch64::SVEMaxBitsPerVector / AArch64::SVEBitsPerBlock;		return AArch64::SVEMaxBitsPerVector / AArch64::SVEBitsPerBlock;
return BaseT::getMaxVScale();		return BaseT::getMaxVScale();
}		}

unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);

		InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
		Align Alignment, unsigned AddressSpace,
		TTI::TargetCostKind CostKind);

InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment,		Align Alignment,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		InstructionCost getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
TTI::CastContextHint CCH,		TTI::CastContextHint CCH,
▲ Show 20 Lines • Show All 159 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,032 Lines • ▼ Show 20 Lines	AArch64TTIImpl::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {
Options.NumLoadsPerBlock = Options.MaxNumLoads;		Options.NumLoadsPerBlock = Options.MaxNumLoads;
// TODO: Though vector loads usually perform well on AArch64, in some targets		// TODO: Though vector loads usually perform well on AArch64, in some targets
// they may wake up the FP unit, which raises the power consumption. Perhaps		// they may wake up the FP unit, which raises the power consumption. Perhaps
// they could be used with no holds barred (-O3).		// they could be used with no holds barred (-O3).
Options.LoadSizes = {8, 4, 2, 1};		Options.LoadSizes = {8, 4, 2, 1};
return Options;		return Options;
}		}

		InstructionCost
		AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
		Align Alignment, unsigned AddressSpace,
		TTI::TargetCostKind CostKind) {
		CarolineConcattoUnsubmitted Done Reply Inline Actions Hey David, have you used clang-format or it is Phabricator? CarolineConcatto: Hey David, have you used clang-format or it is Phabricator?
		if (!isa<ScalableVectorType>(Src))
		sdesmalenUnsubmitted Done Reply Inline Actions A cost of '2' would only be valid for legal types. It probably needs to consider legalisation, so that the cost of a <vscale x 4 x i64> would be 4. sdesmalen: A cost of '2' would only be valid for legal types. It probably needs to consider legalisation…
		return BaseT::getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace,
		CostKind);
		auto LT = TLI->getTypeLegalizationCost(DL, Src);
		return LT.first * 2;
		}
		junparserUnsubmitted Not Done Reply Inline Actions would it better to also consider useSVEForFixedLengthVectors here? junparser: would it better to also consider useSVEForFixedLengthVectors here?
		david-armAuthorUnsubmitted Done Reply Inline Actions Yes definitely once we have support for lowering masked loads/stores using SVE for fixed width vectors. At the moment though we still continue to scalarise masked loads/stores. There is work in progress I believe on lowering fixed width masked loads/stores to use SVE so once that patch lands we can update this cost model too. david-arm: Yes definitely once we have support for lowering masked loads/stores using SVE for fixed width…

		sdesmalenUnsubmitted Done Reply Inline Actions Should this actually be something to add to BasicTTIImpl.h so that it can be reused for other targets? The cost of implementing a masked memory op would be: `NumElts * (cost(load element) + cost(insert element)) <=> NumElts * cost(load element) + ScalarizationOverhead` Then we only have to implement this function for the scalable case, and all other cases can call the BasicTTIImpl. sdesmalen: Should this actually be something to add to BasicTTIImpl.h so that it can be reused for other…
		david-armAuthorUnsubmitted Done Reply Inline Actions I don't think BasicTTIImpl currently has getMaskedMemoryOpCost, but I can look into adding one if we think it's worthwhile? If so, we'd almost certainly want to update the ARM target too, because it does something very similar. Does that cost you mention above take into account the compare and branch? I'd expect something like: %1 = icmp br i1 %1, ... %2 = load ... %3 = insertlement .... %2 I think it's important to reflect the cost of the branch here, since that's something the vector version wouldn't have. david-arm: I don't think BasicTTIImpl currently has getMaskedMemoryOpCost, but I can look into adding one…
InstructionCost AArch64TTIImpl::getGatherScatterOpCost(		InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
unsigned Opcode, Type DataTy, const Value Ptr, bool VariableMask,		unsigned Opcode, Type DataTy, const Value Ptr, bool VariableMask,
Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) {		Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) {

if (!isa<ScalableVectorType>(DataTy))		if (!isa<ScalableVectorType>(DataTy))
return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,		return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
Alignment, CostKind, I);		Alignment, CostKind, I);
auto *VT = cast<VectorType>(DataTy);		auto *VT = cast<VectorType>(DataTy);
▲ Show 20 Lines • Show All 498 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
				; RUN: opt < %s -cost-model -analyze -mtriple=aarch64-linux-gnu -mattr=+sve \| FileCheck %s

				define void @fixed() {
				; CHECK-LABEL: 'fixed'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v2i8 = call <2 x i8> @llvm.masked.load.v2i8.p0v2i8(<2 x i8>* undef, i32 8, <2 x i1> undef, <2 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v4i8 = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>* undef, i32 8, <4 x i1> undef, <4 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 53 for instruction: %v8i8 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>* undef, i32 8, <8 x i1> undef, <8 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 109 for instruction: %v16i8 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>* undef, i32 8, <16 x i1> undef, <16 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v2i16 = call <2 x i16> @llvm.masked.load.v2i16.p0v2i16(<2 x i16>* undef, i32 8, <2 x i1> undef, <2 x i16> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v4i16 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>* undef, i32 8, <4 x i1> undef, <4 x i16> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 53 for instruction: %v8i16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>* undef, i32 8, <8 x i1> undef, <8 x i16> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v2i32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>* undef, i32 8, <2 x i1> undef, <2 x i32> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v4i32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* undef, i32 8, <4 x i1> undef, <4 x i32> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v2i64 = call <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>* undef, i32 8, <2 x i1> undef, <2 x i64> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0v2f16(<2 x half>* undef, i32 8, <2 x i1> undef, <2 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0v4f16(<4 x half>* undef, i32 8, <4 x i1> undef, <4 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 53 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0v8f16(<8 x half>* undef, i32 8, <8 x i1> undef, <8 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v2f32 = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>* undef, i32 8, <2 x i1> undef, <2 x float> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 25 for instruction: %v4f32 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>* undef, i32 8, <4 x i1> undef, <4 x float> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 11 for instruction: %v2f64 = call <2 x double> @llvm.masked.load.v2f64.p0v2f64(<2 x double>* undef, i32 8, <2 x i1> undef, <2 x double> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 22 for instruction: %v4i64 = call <4 x i64> @llvm.masked.load.v4i64.p0v4i64(<4 x i64>* undef, i32 8, <4 x i1> undef, <4 x i64> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 212 for instruction: %v32f16 = call <32 x half> @llvm.masked.load.v32f16.p0v32f16(<32 x half>* undef, i32 8, <32 x i1> undef, <32 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
				;
				entry:
				; Legal fixed-width integer types
				%v2i8 = call <2 x i8> @llvm.masked.load.v2i8.p0v2i8(<2 x i8> *undef, i32 8, <2 x i1> undef, <2 x i8> undef)
				%v4i8 = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> *undef, i32 8, <4 x i1> undef, <4 x i8> undef)
				%v8i8 = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> *undef, i32 8, <8 x i1> undef, <8 x i8> undef)
				%v16i8 = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> *undef, i32 8, <16 x i1> undef, <16 x i8> undef)
				%v2i16 = call <2 x i16> @llvm.masked.load.v2i16.p0v2i16(<2 x i16> *undef, i32 8, <2 x i1> undef, <2 x i16> undef)
				%v4i16 = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> *undef, i32 8, <4 x i1> undef, <4 x i16> undef)
				%v8i16 = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> *undef, i32 8, <8 x i1> undef, <8 x i16> undef)
				%v2i32 = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32> *undef, i32 8, <2 x i1> undef, <2 x i32> undef)
				%v4i32 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> *undef, i32 8, <4 x i1> undef, <4 x i32> undef)
				%v2i64 = call <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64> *undef, i32 8, <2 x i1> undef, <2 x i64> undef)

				; Legal fixed-width floating point types
				%v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0v2f16(<2 x half> *undef, i32 8, <2 x i1> undef, <2 x half> undef)
				%v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0v4f16(<4 x half> *undef, i32 8, <4 x i1> undef, <4 x half> undef)
				%v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0v8f16(<8 x half> *undef, i32 8, <8 x i1> undef, <8 x half> undef)
				%v2f32 = call <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float> *undef, i32 8, <2 x i1> undef, <2 x float> undef)
				%v4f32 = call <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float> *undef, i32 8, <4 x i1> undef, <4 x float> undef)
				%v2f64 = call <2 x double> @llvm.masked.load.v2f64.p0v2f64(<2 x double> *undef, i32 8, <2 x i1> undef, <2 x double> undef)

				; A couple of examples of illegal fixed-width types
				%v4i64 = call <4 x i64> @llvm.masked.load.v4i64.p0v4i64(<4 x i64> *undef, i32 8, <4 x i1> undef, <4 x i64> undef)
				%v32f16 = call <32 x half> @llvm.masked.load.v32f16.p0v32f16(<32 x half> *undef, i32 8, <32 x i1> undef, <32 x half> undef)

				ret void
				}


				define void @scalable() {
				; CHECK-LABEL: 'scalable'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i8 = call <vscale x 2 x i8> @llvm.masked.load.nxv2i8.p0nxv2i8(<vscale x 2 x i8>* undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i8 = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0nxv4i8(<vscale x 4 x i8>* undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv8i8 = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0nxv8i8(<vscale x 8 x i8>* undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv16i8 = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8>* undef, i32 8, <vscale x 16 x i1> undef, <vscale x 16 x i8> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i16 = call <vscale x 2 x i16> @llvm.masked.load.nxv2i16.p0nxv2i16(<vscale x 2 x i16>* undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i16> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i16 = call <vscale x 4 x i16> @llvm.masked.load.nxv4i16.p0nxv4i16(<vscale x 4 x i16>* undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i16> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv8i16 = call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0nxv8i16(<vscale x 8 x i16>* undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x i16> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i32 = call <vscale x 2 x i32> @llvm.masked.load.nxv2i32.p0nxv2i32(<vscale x 2 x i32>* undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i32> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4i32 = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32>* undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i32> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2i64 = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64>* undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i64> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2f16 = call <vscale x 2 x half> @llvm.masked.load.nxv2f16.p0nxv2f16(<vscale x 2 x half>* undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4f16 = call <vscale x 4 x half> @llvm.masked.load.nxv4f16.p0nxv4f16(<vscale x 4 x half>* undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv8f16 = call <vscale x 8 x half> @llvm.masked.load.nxv8f16.p0nxv8f16(<vscale x 8 x half>* undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2f32 = call <vscale x 2 x float> @llvm.masked.load.nxv2f32.p0nxv2f32(<vscale x 2 x float>* undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x float> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv4f32 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float>* undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x float> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %nxv2f64 = call <vscale x 2 x double> @llvm.masked.load.nxv2f64.p0nxv2f64(<vscale x 2 x double>* undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x double> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %nxv4i64 = call <vscale x 4 x i64> @llvm.masked.load.nxv4i64.p0nxv4i64(<vscale x 4 x i64>* undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i64> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %nxv32f16 = call <vscale x 32 x half> @llvm.masked.load.nxv32f16.p0nxv32f16(<vscale x 32 x half>* undef, i32 8, <vscale x 32 x i1> undef, <vscale x 32 x half> undef)
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
				;
				entry:
				; Legal scalable integer types
				%nxv2i8 = call <vscale x 2 x i8> @llvm.masked.load.nxv2i8.p0nxv2i8(<vscale x 2 x i8> *undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i8> undef)
				%nxv4i8 = call <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0nxv4i8(<vscale x 4 x i8> *undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i8> undef)
				%nxv8i8 = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0nxv8i8(<vscale x 8 x i8> *undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x i8> undef)
				%nxv16i8 = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> *undef, i32 8, <vscale x 16 x i1> undef, <vscale x 16 x i8> undef)
				%nxv2i16 = call <vscale x 2 x i16> @llvm.masked.load.nxv2i16.p0nxv2i16(<vscale x 2 x i16> *undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i16> undef)
				%nxv4i16 = call <vscale x 4 x i16> @llvm.masked.load.nxv4i16.p0nxv4i16(<vscale x 4 x i16> *undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i16> undef)
				%nxv8i16 = call <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0nxv8i16(<vscale x 8 x i16> *undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x i16> undef)
				%nxv2i32 = call <vscale x 2 x i32> @llvm.masked.load.nxv2i32.p0nxv2i32(<vscale x 2 x i32> *undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i32> undef)
				%nxv4i32 = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> *undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i32> undef)
				%nxv2i64 = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> *undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x i64> undef)

				; Legal scalable floating point types
				%nxv2f16 = call <vscale x 2 x half> @llvm.masked.load.nxv2f16.p0nxv2f16(<vscale x 2 x half> *undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x half> undef)
				%nxv4f16 = call <vscale x 4 x half> @llvm.masked.load.nxv4f16.p0nxv4f16(<vscale x 4 x half> *undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x half> undef)
				%nxv8f16 = call <vscale x 8 x half> @llvm.masked.load.nxv8f16.p0nxv8f16(<vscale x 8 x half> *undef, i32 8, <vscale x 8 x i1> undef, <vscale x 8 x half> undef)
				%nxv2f32 = call <vscale x 2 x float> @llvm.masked.load.nxv2f32.p0nxv2f32(<vscale x 2 x float> *undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x float> undef)
				%nxv4f32 = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> *undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x float> undef)
				%nxv2f64 = call <vscale x 2 x double> @llvm.masked.load.nxv2f64.p0nxv2f64(<vscale x 2 x double> *undef, i32 8, <vscale x 2 x i1> undef, <vscale x 2 x double> undef)

				; A couple of examples of illegal scalable types
				%nxv4i64 = call <vscale x 4 x i64> @llvm.masked.load.nxv4i64.p0nxv4i64(<vscale x 4 x i64> *undef, i32 8, <vscale x 4 x i1> undef, <vscale x 4 x i64> undef)
				%nxv32f16 = call <vscale x 32 x half> @llvm.masked.load.nxv32f16.p0nxv32f16(<vscale x 32 x half> *undef, i32 8, <vscale x 32 x i1> undef, <vscale x 32 x half> undef)

				ret void
				}

				declare <2 x i8> @llvm.masked.load.v2i8.p0v2i8(<2 x i8>*, i32, <2 x i1>, <2 x i8>)
				declare <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8>*, i32, <4 x i1>, <4 x i8>)
				declare <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8>*, i32, <8 x i1>, <8 x i8>)
				declare <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8>*, i32, <16 x i1>, <16 x i8>)
				declare <2 x i16> @llvm.masked.load.v2i16.p0v2i16(<2 x i16>*, i32, <2 x i1>, <2 x i16>)
				declare <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16>*, i32, <4 x i1>, <4 x i16>)
				declare <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16>*, i32, <8 x i1>, <8 x i16>)
				declare <2 x i32> @llvm.masked.load.v2i32.p0v2i32(<2 x i32>*, i32, <2 x i1>, <2 x i32>)
				declare <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>*, i32, <4 x i1>, <4 x i32>)
				declare <2 x i64> @llvm.masked.load.v2i64.p0v2i64(<2 x i64>*, i32, <2 x i1>, <2 x i64>)
				declare <4 x i64> @llvm.masked.load.v4i64.p0v4i64(<4 x i64>*, i32, <4 x i1>, <4 x i64>)
				declare <2 x half> @llvm.masked.load.v2f16.p0v2f16(<2 x half>*, i32, <2 x i1>, <2 x half>)
				declare <4 x half> @llvm.masked.load.v4f16.p0v4f16(<4 x half>*, i32, <4 x i1>, <4 x half>)
				declare <8 x half> @llvm.masked.load.v8f16.p0v8f16(<8 x half>*, i32, <8 x i1>, <8 x half>)
				declare <32 x half> @llvm.masked.load.v32f16.p0v32f16(<32 x half>*, i32, <32 x i1>, <32 x half>)
				declare <2 x float> @llvm.masked.load.v2f32.p0v2f32(<2 x float>*, i32, <2 x i1>, <2 x float>)
				declare <4 x float> @llvm.masked.load.v4f32.p0v4f32(<4 x float>*, i32, <4 x i1>, <4 x float>)
				declare <2 x double> @llvm.masked.load.v2f64.p0v2f64(<2 x double>*, i32, <2 x i1>, <2 x double>)


				declare <vscale x 2 x i8> @llvm.masked.load.nxv2i8.p0nxv2i8(<vscale x 2 x i8>*, i32, <vscale x 2 x i1>, <vscale x 2 x i8>)
				declare <vscale x 4 x i8> @llvm.masked.load.nxv4i8.p0nxv4i8(<vscale x 4 x i8>*, i32, <vscale x 4 x i1>, <vscale x 4 x i8>)
				declare <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0nxv8i8(<vscale x 8 x i8>*, i32, <vscale x 8 x i1>, <vscale x 8 x i8>)
				declare <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8>*, i32, <vscale x 16 x i1>, <vscale x 16 x i8>)
				declare <vscale x 2 x i16> @llvm.masked.load.nxv2i16.p0nxv2i16(<vscale x 2 x i16>*, i32, <vscale x 2 x i1>, <vscale x 2 x i16>)
				declare <vscale x 4 x i16> @llvm.masked.load.nxv4i16.p0nxv4i16(<vscale x 4 x i16>*, i32, <vscale x 4 x i1>, <vscale x 4 x i16>)
				declare <vscale x 8 x i16> @llvm.masked.load.nxv8i16.p0nxv8i16(<vscale x 8 x i16>*, i32, <vscale x 8 x i1>, <vscale x 8 x i16>)
				declare <vscale x 2 x i32> @llvm.masked.load.nxv2i32.p0nxv2i32(<vscale x 2 x i32>*, i32, <vscale x 2 x i1>, <vscale x 2 x i32>)
				declare <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32>*, i32, <vscale x 4 x i1>, <vscale x 4 x i32>)
				declare <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64>*, i32, <vscale x 2 x i1>, <vscale x 2 x i64>)
				declare <vscale x 4 x i64> @llvm.masked.load.nxv4i64.p0nxv4i64(<vscale x 4 x i64>*, i32, <vscale x 4 x i1>, <vscale x 4 x i64>)
				declare <vscale x 2 x half> @llvm.masked.load.nxv2f16.p0nxv2f16(<vscale x 2 x half>*, i32, <vscale x 2 x i1>, <vscale x 2 x half>)
				declare <vscale x 4 x half> @llvm.masked.load.nxv4f16.p0nxv4f16(<vscale x 4 x half>*, i32, <vscale x 4 x i1>, <vscale x 4 x half>)
				declare <vscale x 8 x half> @llvm.masked.load.nxv8f16.p0nxv8f16(<vscale x 8 x half>*, i32, <vscale x 8 x i1>, <vscale x 8 x half>)
				declare <vscale x 32 x half> @llvm.masked.load.nxv32f16.p0nxv32f16(<vscale x 32 x half>*, i32, <vscale x 32 x i1>, <vscale x 32 x half>)
				declare <vscale x 2 x float> @llvm.masked.load.nxv2f32.p0nxv2f32(<vscale x 2 x float>*, i32, <vscale x 2 x i1>, <vscale x 2 x float>)
				declare <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float>*, i32, <vscale x 4 x i1>, <vscale x 4 x float>)
				declare <vscale x 2 x double> @llvm.masked.load.nxv2f64.p0nxv2f64(<vscale x 2 x double>*, i32, <vscale x 2 x i1>, <vscale x 2 x double>)

llvm/test/Transforms/LoopVectorize/AArch64/masked-op-cost.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -loop-vectorize -force-vector-interleave=1 -S -debug < %s 2>%t \| FileCheck %s
				; RUN: cat %t \| FileCheck %s --check-prefix=CHECK-COST

				target triple = "aarch64-unknown-linux-gnu"

				; CHECK-COST: Checking a loop in "fixed_width"
				; CHECK-COST: Found an estimated cost of 11 for VF 2 For instruction: store i32 2, i32* %arrayidx1, align 4
				; CHECK-COST: Found an estimated cost of 25 for VF 4 For instruction: store i32 2, i32* %arrayidx1, align 4
				; CHECK-COST: Selecting VF: 1.

				; We should decide this loop is not worth vectorising using fixed width vectors
				define void @fixed_width(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i64 %n) #0 {
				; CHECK-LABEL: @fixed_width(
				; CHECK-NOT: vector.body
				entry:
				%cmp6 = icmp sgt i64 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.inc
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void

				for.body: ; preds = %for.body.preheader, %for.inc
				%i.07 = phi i64 [ %inc, %for.inc ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%tobool.not = icmp eq i32 %0, 0
				br i1 %tobool.not, label %for.inc, label %if.then

				if.then: ; preds = %for.body
				%arrayidx1 = getelementptr inbounds i32, i32* %a, i64 %i.07
				store i32 2, i32* %arrayidx1, align 4
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%inc = add nuw nsw i64 %i.07, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body
				}


				; CHECK-COST: Checking a loop in "scalable"
				; CHECK-COST: Found an estimated cost of 2 for VF vscale x 4 For instruction: store i32 2, i32* %arrayidx1, align 4

				define void @scalable(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i64 %n) #0 {
				; CHECK-LABEL: @scalable(
				; CHECK: vector.body
				; CHECK: call void @llvm.masked.store.nxv4i32.p0nxv4i32
				entry:
				%cmp6 = icmp sgt i64 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				br label %for.body

				for.cond.cleanup.loopexit: ; preds = %for.inc
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
				ret void

				for.body: ; preds = %for.body.preheader, %for.inc
				%i.07 = phi i64 [ %inc, %for.inc ], [ 0, %for.body.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %i.07
				%0 = load i32, i32* %arrayidx, align 4
				%tobool.not = icmp eq i32 %0, 0
				br i1 %tobool.not, label %for.inc, label %if.then

				if.then: ; preds = %for.body
				%arrayidx1 = getelementptr inbounds i32, i32* %a, i64 %i.07
				store i32 2, i32* %arrayidx1, align 4
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%inc = add nuw nsw i64 %i.07, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !0
				}

				attributes #0 = { "target-features"="+neon,+sve" }

				!0 = distinct !{!0, !1, !2, !3, !4}
				!1 = !{!"llvm.loop.mustprogress"}
				!2 = !{!"llvm.loop.vectorize.width", i32 4}
				!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				!4 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll

	Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines
	for.inc: ; preds = %for.body, %if.then			for.inc: ; preds = %for.body, %if.then
	%cmp = icmp sgt i64 %i.08.in, 1			%cmp = icmp sgt i64 %i.08.in, 1
	br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0			br i1 %cmp, label %for.body, label %for.cond.cleanup, !llvm.loop !0
	}			}

	attributes #0 = {"target-cpu"="generic" "target-features"="+neon,+sve"}			attributes #0 = {"target-cpu"="generic" "target-features"="+neon,+sve"}


	!0 = distinct !{!0, !1, !2, !3, !4}			!0 = distinct !{!0, !1, !2, !3, !4, !5}
	!1 = !{!"llvm.loop.mustprogress"}			!1 = !{!"llvm.loop.mustprogress"}
	!2 = !{!"llvm.loop.vectorize.width", i32 4}			!2 = !{!"llvm.loop.vectorize.width", i32 4}
	!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}			!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 false}
	!4 = !{!"llvm.loop.vectorize.enable", i1 true}			!4 = !{!"llvm.loop.vectorize.enable", i1 true}
				!5 = !{!"llvm.loop.interleave.count", i32 2}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost functionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 340467

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll

llvm/test/Transforms/LoopVectorize/AArch64/masked-op-cost.ll

llvm/test/Transforms/LoopVectorize/AArch64/vector-reverse-mask4.ll

[AArch64] Add AArch64TTIImpl::getMaskedMemoryOpCost function
ClosedPublic