This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/AArch64/
-
AArch64/
-
AArch64TargetTransformInfo.h
-
Transforms/Vectorize/
-
Vectorize/
1/3
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
scalable-vectorization-cost-tuning.ll
-
scalable-vectorization.ll
-
scalable-vf-hint.ll

Differential D113209

[LV] Use VScaleForTuning to fine-tune the cost per lane.
ClosedPublic

Authored by sdesmalen on Nov 4 2021, 11:23 AM.

Download Raw Diff

Details

Reviewers

kmclaughlin
david-arm
frasercrmck

Commits

rG2829376bb267: [LV] Use VScaleForTuning to fine-tune the cost per lane.

Summary

When targeting a specific CPU with scalable vectorization, the knowledge
of that particular CPU's vscale value can be used to tune the cost-model
and make the cost per lane less pessimistic.

If the target implements 'TTI.getVScaleForTuning()', the cost-per-lane
is calculated as:

Cost / (VScaleForTuning * VF.KnownMinLanes)

Otherwise, it assumes a value of 1 meaning that the behavior
is unchanged and calculated as:

Cost / VF.KnownMinLanes

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sdesmalen created this revision.Nov 4 2021, 11:23 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptNov 4 2021, 11:23 AM

sdesmalen requested review of this revision.Nov 4 2021, 11:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 4 2021, 11:23 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

sdesmalen added reviewers: kmclaughlin, david-arm, frasercrmck.Nov 4 2021, 11:24 AM

Harbormaster completed remote builds in B132510: Diff 384822.Nov 4 2021, 1:03 PM

Thanks @sdesmalen, this LGTM!

This revision is now accepted and ready to land.Nov 5 2021, 3:50 AM

How does this interact with vscale_range? Could it perhaps automatically infer getVScaleForTuning using that? Or is the idea the target ultimately chooses?

In D113209#3111176, @frasercrmck wrote:

How does this interact with vscale_range? Could it perhaps automatically infer getVScaleForTuning using that? Or is the idea the target ultimately chooses?

The two are actually quite different; vscale_range specifies the range of vscale that the compiled binary is compatible with. LLVM guarantees that the compiled binary is correct for that vscale_range. VScaleForTuning can be set separately by mcpu/mtune and will purely tune the cost-model without changing the requirements on vscale. This means it doesn't change the compatibility of the binary, it just helps choose a better VF for the CPU it compiles for.

In D113209#3111194, @sdesmalen wrote:

In D113209#3111176, @frasercrmck wrote:

How does this interact with vscale_range? Could it perhaps automatically infer getVScaleForTuning using that? Or is the idea the target ultimately chooses?

The two are actually quite different; vscale_range specifies the range of vscale that the compiled binary is compatible with. LLVM guarantees that the compiled binary is correct for that vscale_range. VScaleForTuning can be set separately by mcpu/mtune and will purely tune the cost-model without changing the requirements on vscale. This means it doesn't change the compatibility of the binary, it just helps choose a better VF for the CPU it compiles for.

Fair enough, thanks for the explanation!

david-arm added inline comments.Nov 5 2021, 7:35 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6009	nit: Could you remove the unnecessary white space before merging? Thanks!
6025	nit: This is just a suggestion, but it feels a little odd to be rescaling ElementCount by vscale, because this is already implicit for a scalable element count. For example, what we're doing here is taking a <vscale x 4> ElementCount and then multiplying by another vscale so we end up effectively with an ElementCount like <vscale x vscale x 4>. Is it worth just using unsigned values instead? unsigned EstimatedWidthA = A.Width.getKnownMinValue(); unsigned EstimatedWidthB = B.Width.getKnownMinValue(); if (Optional<unsigned> VScale = TTI.getVScaleForTuning()) { if (A.Width.isScalable()) EstimatedWidthA = VScale.getValue(); if (B.Width.isScalable()) EstimatedWidthB = VScale.getValue(); } ...

sdesmalen added inline comments.Nov 5 2021, 8:15 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6025	I was a bit worried it would get confusing with using both A.Width.isScalable() in combination EstimatedWidthA, but perhaps splitting it up is more clear as you suggest.

Use unsigned instead of ElementCount for 'EstimatedWidth'.

Harbormaster completed remote builds in B132691: Diff 385079.Nov 5 2021, 9:32 AM

LGTM! Thanks for making the changes @sdesmalen. :partyparrot

Closed by commit rG2829376bb267: [LV] Use VScaleForTuning to fine-tune the cost per lane. (authored by sdesmalen). · Explain WhyNov 8 2021, 9:00 AM

This revision was automatically updated to reflect the committed changes.

sdesmalen added a commit: rG2829376bb267: [LV] Use VScaleForTuning to fine-tune the cost per lane..

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

TargetTransformInfoImpl.h

1 line

CodeGen/

BasicTTIImpl.h

1 line

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

AArch64/

AArch64TargetTransformInfo.h

3 lines

Transforms/

Vectorize/

LoopVectorize.cpp

43 lines

test/

Transforms/

LoopVectorize/

AArch64/

scalable-vectorization-cost-tuning.ll

54 lines

scalable-vectorization.ll

6 lines

scalable-vf-hint.ll

4 lines

Diff 385521

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 910 Lines • ▼ Show 20 Lines	public:

/// \return The width of the smallest vector register type.		/// \return The width of the smallest vector register type.
unsigned getMinVectorRegisterBitWidth() const;		unsigned getMinVectorRegisterBitWidth() const;

/// \return The maximum value of vscale if the target specifies an		/// \return The maximum value of vscale if the target specifies an
/// architectural maximum vector length, and None otherwise.		/// architectural maximum vector length, and None otherwise.
Optional<unsigned> getMaxVScale() const;		Optional<unsigned> getMaxVScale() const;

		/// \return the value of vscale to tune the cost model for.
		Optional<unsigned> getVScaleForTuning() const;

/// \return True if the vectorization factor should be chosen to		/// \return True if the vectorization factor should be chosen to
/// make the vector of the smallest element type match the size of a		/// make the vector of the smallest element type match the size of a
/// vector register. For wider element types, this could result in		/// vector register. For wider element types, this could result in
/// creating vectors that span multiple vector registers.		/// creating vectors that span multiple vector registers.
/// If false, the vectorization factor will be chosen based on the		/// If false, the vectorization factor will be chosen based on the
/// size of the widest element type.		/// size of the widest element type.
bool shouldMaximizeVectorBandwidth() const;		bool shouldMaximizeVectorBandwidth() const;

▲ Show 20 Lines • Show All 658 Lines • ▼ Show 20 Lines	virtual InstructionCost getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
TargetCostKind CostKind) = 0;		TargetCostKind CostKind) = 0;
virtual unsigned getNumberOfRegisters(unsigned ClassID) const = 0;		virtual unsigned getNumberOfRegisters(unsigned ClassID) const = 0;
virtual unsigned getRegisterClassForType(bool Vector,		virtual unsigned getRegisterClassForType(bool Vector,
Type *Ty = nullptr) const = 0;		Type *Ty = nullptr) const = 0;
virtual const char *getRegisterClassName(unsigned ClassID) const = 0;		virtual const char *getRegisterClassName(unsigned ClassID) const = 0;
virtual TypeSize getRegisterBitWidth(RegisterKind K) const = 0;		virtual TypeSize getRegisterBitWidth(RegisterKind K) const = 0;
virtual unsigned getMinVectorRegisterBitWidth() const = 0;		virtual unsigned getMinVectorRegisterBitWidth() const = 0;
virtual Optional<unsigned> getMaxVScale() const = 0;		virtual Optional<unsigned> getMaxVScale() const = 0;
		virtual Optional<unsigned> getVScaleForTuning() const = 0;
virtual bool shouldMaximizeVectorBandwidth() const = 0;		virtual bool shouldMaximizeVectorBandwidth() const = 0;
virtual ElementCount getMinimumVF(unsigned ElemWidth,		virtual ElementCount getMinimumVF(unsigned ElemWidth,
bool IsScalable) const = 0;		bool IsScalable) const = 0;
virtual unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const = 0;		virtual unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const = 0;
virtual bool shouldConsiderAddressTypePromotion(		virtual bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;		const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;
virtual unsigned getCacheLineSize() const = 0;		virtual unsigned getCacheLineSize() const = 0;
virtual Optional<unsigned> getCacheSize(CacheLevel Level) const = 0;		virtual Optional<unsigned> getCacheSize(CacheLevel Level) const = 0;
▲ Show 20 Lines • Show All 454 Lines • ▼ Show 20 Lines	TypeSize getRegisterBitWidth(RegisterKind K) const override {
return Impl.getRegisterBitWidth(K);		return Impl.getRegisterBitWidth(K);
}		}
unsigned getMinVectorRegisterBitWidth() const override {		unsigned getMinVectorRegisterBitWidth() const override {
return Impl.getMinVectorRegisterBitWidth();		return Impl.getMinVectorRegisterBitWidth();
}		}
Optional<unsigned> getMaxVScale() const override {		Optional<unsigned> getMaxVScale() const override {
return Impl.getMaxVScale();		return Impl.getMaxVScale();
}		}
		Optional<unsigned> getVScaleForTuning() const override {
		return Impl.getVScaleForTuning();
		}
bool shouldMaximizeVectorBandwidth() const override {		bool shouldMaximizeVectorBandwidth() const override {
return Impl.shouldMaximizeVectorBandwidth();		return Impl.shouldMaximizeVectorBandwidth();
}		}
ElementCount getMinimumVF(unsigned ElemWidth,		ElementCount getMinimumVF(unsigned ElemWidth,
bool IsScalable) const override {		bool IsScalable) const override {
return Impl.getMinimumVF(ElemWidth, IsScalable);		return Impl.getMinimumVF(ElemWidth, IsScalable);
}		}
unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const override {		unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const override {
▲ Show 20 Lines • Show All 384 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 393 Lines • ▼ Show 20 Lines	public:

TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {		TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {
return TypeSize::getFixed(32);		return TypeSize::getFixed(32);
}		}

unsigned getMinVectorRegisterBitWidth() const { return 128; }		unsigned getMinVectorRegisterBitWidth() const { return 128; }

Optional<unsigned> getMaxVScale() const { return None; }		Optional<unsigned> getMaxVScale() const { return None; }
		Optional<unsigned> getVScaleForTuning() const { return None; }

bool shouldMaximizeVectorBandwidth() const { return false; }		bool shouldMaximizeVectorBandwidth() const { return false; }

ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const {		ElementCount getMinimumVF(unsigned ElemWidth, bool IsScalable) const {
return ElementCount::get(0, IsScalable);		return ElementCount::get(0, IsScalable);
}		}

unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const { return 0; }		unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const { return 0; }
▲ Show 20 Lines • Show All 794 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 659 Lines • ▼ Show 20 Lines	public:
/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {		TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {
return TypeSize::getFixed(32);		return TypeSize::getFixed(32);
}		}

Optional<unsigned> getMaxVScale() const { return None; }		Optional<unsigned> getMaxVScale() const { return None; }
		Optional<unsigned> getVScaleForTuning() const { return None; }

/// Estimate the overhead of scalarizing an instruction. Insert and Extract		/// Estimate the overhead of scalarizing an instruction. Insert and Extract
/// are set if the demanded result elements need to be inserted and/or		/// are set if the demanded result elements need to be inserted and/or
/// extracted from vectors.		/// extracted from vectors.
InstructionCost getScalarizationOverhead(VectorType *InTy,		InstructionCost getScalarizationOverhead(VectorType *InTy,
const APInt &DemandedElts,		const APInt &DemandedElts,
bool Insert, bool Extract) {		bool Insert, bool Extract) {
/// FIXME: a bitfield is not a reasonable abstraction for talking about		/// FIXME: a bitfield is not a reasonable abstraction for talking about
▲ Show 20 Lines • Show All 1,609 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 598 Lines • ▼ Show 20 Lines
	unsigned TargetTransformInfo::getMinVectorRegisterBitWidth() const {			unsigned TargetTransformInfo::getMinVectorRegisterBitWidth() const {
	return TTIImpl->getMinVectorRegisterBitWidth();			return TTIImpl->getMinVectorRegisterBitWidth();
	}			}

	Optional<unsigned> TargetTransformInfo::getMaxVScale() const {			Optional<unsigned> TargetTransformInfo::getMaxVScale() const {
	return TTIImpl->getMaxVScale();			return TTIImpl->getMaxVScale();
	}			}

				Optional<unsigned> TargetTransformInfo::getVScaleForTuning() const {
				return TTIImpl->getVScaleForTuning();
				}

	bool TargetTransformInfo::shouldMaximizeVectorBandwidth() const {			bool TargetTransformInfo::shouldMaximizeVectorBandwidth() const {
	return TTIImpl->shouldMaximizeVectorBandwidth();			return TTIImpl->shouldMaximizeVectorBandwidth();
	}			}

	ElementCount TargetTransformInfo::getMinimumVF(unsigned ElemWidth,			ElementCount TargetTransformInfo::getMinimumVF(unsigned ElemWidth,
	bool IsScalable) const {			bool IsScalable) const {
	return TTIImpl->getMinimumVF(ElemWidth, IsScalable);			return TTIImpl->getMinimumVF(ElemWidth, IsScalable);
	}			}
	▲ Show 20 Lines • Show All 577 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const {
}		}
llvm_unreachable("Unsupported register kind");		llvm_unreachable("Unsupported register kind");
}		}

unsigned getMinVectorRegisterBitWidth() const {		unsigned getMinVectorRegisterBitWidth() const {
return ST->getMinVectorRegisterBitWidth();		return ST->getMinVectorRegisterBitWidth();
}		}

		Optional<unsigned> getVScaleForTuning() const {
		return ST->getVScaleForTuning();
		}

/// Try to return an estimate cost factor that can be used as a multiplier		/// Try to return an estimate cost factor that can be used as a multiplier
/// when scalarizing an operation for a vector with ElementCount \p VF.		/// when scalarizing an operation for a vector with ElementCount \p VF.
/// For scalable vectors this currently takes the most pessimistic view based		/// For scalable vectors this currently takes the most pessimistic view based
/// upon the maximum possible value for vscale.		/// upon the maximum possible value for vscale.
unsigned getMaxNumElements(ElementCount VF) const {		unsigned getMaxNumElements(ElementCount VF) const {
if (!VF.isScalable())		if (!VF.isScalable())
return VF.getFixedValue();		return VF.getFixedValue();
▲ Show 20 Lines • Show All 191 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,000 Lines • ▼ Show 20 Lines
}		}

bool LoopVectorizationCostModel::isMoreProfitable(		bool LoopVectorizationCostModel::isMoreProfitable(
const VectorizationFactor &A, const VectorizationFactor &B) const {		const VectorizationFactor &A, const VectorizationFactor &B) const {
InstructionCost CostA = A.Cost;		InstructionCost CostA = A.Cost;
InstructionCost CostB = B.Cost;		InstructionCost CostB = B.Cost;

unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);		unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);

david-armUnsubmitted Not Done Reply Inline Actions nit: Could you remove the unnecessary white space before merging? Thanks! david-arm: nit: Could you remove the unnecessary white space before merging? Thanks!
if (!A.Width.isScalable() && !B.Width.isScalable() && FoldTailByMasking &&		if (!A.Width.isScalable() && !B.Width.isScalable() && FoldTailByMasking &&
MaxTripCount) {		MaxTripCount) {
// If we are folding the tail and the trip count is a known (possibly small)		// If we are folding the tail and the trip count is a known (possibly small)
// constant, the trip count will be rounded up to an integer number of		// constant, the trip count will be rounded up to an integer number of
// iterations. The total cost will be PerIterationCost*ceil(TripCount/VF),		// iterations. The total cost will be PerIterationCost*ceil(TripCount/VF),
// which we compare directly. When not folding the tail, the total cost will		// which we compare directly. When not folding the tail, the total cost will
// be PerIterationCost*floor(TC/VF) + Scalar remainder cost, and so is		// be PerIterationCost*floor(TC/VF) + Scalar remainder cost, and so is
// approximated with the per-lane cost below instead of using the tripcount		// approximated with the per-lane cost below instead of using the tripcount
// as here.		// as here.
auto RTCostA = CostA * divideCeil(MaxTripCount, A.Width.getFixedValue());		auto RTCostA = CostA * divideCeil(MaxTripCount, A.Width.getFixedValue());
auto RTCostB = CostB * divideCeil(MaxTripCount, B.Width.getFixedValue());		auto RTCostB = CostB * divideCeil(MaxTripCount, B.Width.getFixedValue());
return RTCostA < RTCostB;		return RTCostA < RTCostB;
}		}

// When set to preferred, for now assume vscale may be larger than 1, so		// Improve estimate for the vector width if it is scalable.
// that scalable vectorization is slightly favorable over fixed-width		unsigned EstimatedWidthA = A.Width.getKnownMinValue();
		david-armUnsubmitted Not Done Reply Inline Actions nit: This is just a suggestion, but it feels a little odd to be rescaling ElementCount by vscale, because this is already implicit for a scalable element count. For example, what we're doing here is taking a <vscale x 4> ElementCount and then multiplying by another vscale so we end up effectively with an ElementCount like <vscale x vscale x 4>. Is it worth just using unsigned values instead? unsigned EstimatedWidthA = A.Width.getKnownMinValue(); unsigned EstimatedWidthB = B.Width.getKnownMinValue(); if (Optional<unsigned> VScale = TTI.getVScaleForTuning()) { if (A.Width.isScalable()) EstimatedWidthA = VScale.getValue(); if (B.Width.isScalable()) EstimatedWidthB = VScale.getValue(); } ... david-arm: nit: This is just a suggestion, but it feels a little odd to be rescaling ElementCount by…
		sdesmalenAuthorUnsubmitted Done Reply Inline Actions I was a bit worried it would get confusing with using both A.Width.isScalable() in combination EstimatedWidthA, but perhaps splitting it up is more clear as you suggest. sdesmalen: I was a bit worried it would get confusing with using both A.Width.isScalable() in combination…
// vectorization.		unsigned EstimatedWidthB = B.Width.getKnownMinValue();
		if (Optional<unsigned> VScale = TTI.getVScaleForTuning()) {
		if (A.Width.isScalable())
		EstimatedWidthA *= VScale.getValue();
		if (B.Width.isScalable())
		EstimatedWidthB *= VScale.getValue();
		}

		// When set to preferred, for now assume vscale may be larger than 1 (or the
		// one being tuned for), so that scalable vectorization is slightly favorable
		// over fixed-width vectorization.
if (Hints->isScalableVectorizationPreferred())		if (Hints->isScalableVectorizationPreferred())
if (A.Width.isScalable() && !B.Width.isScalable())		if (A.Width.isScalable() && !B.Width.isScalable())
return (CostA * B.Width.getKnownMinValue()) <=		return (CostA * B.Width.getFixedValue()) <= (CostB * EstimatedWidthA);
(CostB * A.Width.getKnownMinValue());

// To avoid the need for FP division:		// To avoid the need for FP division:
// (CostA / A.Width) < (CostB / B.Width)		// (CostA / A.Width) < (CostB / B.Width)
// <=> (CostA * B.Width) < (CostB * A.Width)		// <=> (CostA * B.Width) < (CostB * A.Width)
return (CostA * B.Width.getKnownMinValue()) <		return (CostA * EstimatedWidthB) < (CostB * EstimatedWidthA);
(CostB * A.Width.getKnownMinValue());
}		}

VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(		VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(
const ElementCountSet &VFCandidates) {		const ElementCountSet &VFCandidates) {
InstructionCost ExpectedCost = expectedCost(ElementCount::getFixed(1)).first;		InstructionCost ExpectedCost = expectedCost(ElementCount::getFixed(1)).first;
LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");		LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");
assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");		assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");
assert(VFCandidates.count(ElementCount::getFixed(1)) &&		assert(VFCandidates.count(ElementCount::getFixed(1)) &&
Show All 13 Lines	VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(
SmallVector<InstructionVFPair> InvalidCosts;		SmallVector<InstructionVFPair> InvalidCosts;
for (const auto &i : VFCandidates) {		for (const auto &i : VFCandidates) {
// The cost for scalar VF=1 is already calculated, so ignore it.		// The cost for scalar VF=1 is already calculated, so ignore it.
if (i.isScalar())		if (i.isScalar())
continue;		continue;

VectorizationCostTy C = expectedCost(i, &InvalidCosts);		VectorizationCostTy C = expectedCost(i, &InvalidCosts);
VectorizationFactor Candidate(i, C.first);		VectorizationFactor Candidate(i, C.first);
LLVM_DEBUG(
dbgs() << "LV: Vector loop of width " << i << " costs: "		#ifndef NDEBUG
<< (Candidate.Cost / Candidate.Width.getKnownMinValue())		unsigned AssumedMinimumVscale = 1;
<< (i.isScalable() ? " (assuming a minimum vscale of 1)" : "")		if (Optional<unsigned> VScale = TTI.getVScaleForTuning())
<< ".\n");		AssumedMinimumVscale = VScale.getValue();
		unsigned Width =
		Candidate.Width.isScalable()
		? Candidate.Width.getKnownMinValue() * AssumedMinimumVscale
		: Candidate.Width.getFixedValue();
		LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i
		<< " costs: " << (Candidate.Cost / Width));
		if (i.isScalable())
		LLVM_DEBUG(dbgs() << " (assuming a minimum vscale of "
		<< AssumedMinimumVscale << ")");
		LLVM_DEBUG(dbgs() << ".\n");
		#endif

if (!C.second && !ForceVectorization) {		if (!C.second && !ForceVectorization) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not considering vector loop of width " << i		dbgs() << "LV: Not considering vector loop of width " << i
<< " because it will not generate any vector instructions.\n");		<< " because it will not generate any vector instructions.\n");
continue;		continue;
}		}

▲ Show 20 Lines • Show All 4,563 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization-cost-tuning.ll

This file was added.

				; REQUIRES: asserts
				; RUN: opt -mtriple=aarch64 -mattr=+sve -scalable-vectorization=on \
				; RUN: -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize < %s 2>&1 \
				; RUN: \| FileCheck %s --check-prefixes=GENERIC,VF-VSCALE4

				; RUN: opt -mtriple=aarch64 -mattr=+sve -mcpu=generic -scalable-vectorization=on \
				; RUN: -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize < %s 2>&1 \
				; RUN: \| FileCheck %s --check-prefixes=GENERIC,VF-VSCALE4

				; RUN: opt -mtriple=aarch64 -mcpu=neoverse-v1 -scalable-vectorization=on \
				; RUN: -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize < %s 2>&1 \
				; RUN: \| FileCheck %s --check-prefixes=NEOVERSE-V1,VF-VSCALE4

				; RUN: opt -mtriple=aarch64 -mcpu=neoverse-n2 -scalable-vectorization=on \
				; RUN: -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize < %s 2>&1 \
				; RUN: \| FileCheck %s --check-prefixes=NEOVERSE-N2,VF-4

				; RUN: opt -mtriple=aarch64 -mcpu=neoverse-n2 -scalable-vectorization=preferred \
				; RUN: -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize < %s 2>&1 \
				; RUN: \| FileCheck %s --check-prefixes=NEOVERSE-N2,VF-VSCALE4

				; GENERIC: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).
				; GENERIC: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).

				; NEOVERSE-V1: LV: Vector loop of width vscale x 2 costs: 3 (assuming a minimum vscale of 2).
				; NEOVERSE-V1: LV: Vector loop of width vscale x 4 costs: 1 (assuming a minimum vscale of 2).

				; NEOVERSE-N2: LV: Vector loop of width vscale x 2 costs: 6 (assuming a minimum vscale of 1).
				; NEOVERSE-N2: LV: Vector loop of width vscale x 4 costs: 3 (assuming a minimum vscale of 1).

				; VF-4: <4 x i32>
				; VF-VSCALE4: <vscale x 4 x i32>
				define void @test0(i32* %a, i8* %b, i32* %c) #0 {
				entry:
				br label %loop

				loop:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
				%arrayidx = getelementptr inbounds i32, i32* %c, i64 %iv
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i8, i8* %b, i64 %iv
				%1 = load i8, i8* %arrayidx2, align 4
				%zext = zext i8 %1 to i32
				%add = add nsw i32 %zext, %0
				%arrayidx5 = getelementptr inbounds i32, i32* %a, i64 %iv
				store i32 %add, i32* %arrayidx5, align 4
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond.not = icmp eq i64 %iv.next, 1024
				br i1 %exitcond.not, label %exit, label %loop

				exit:
				ret void
				}

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vectorization.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=on < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_ON
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=preferred < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_PREFERRED		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=preferred < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_PREFERRED
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -scalable-vectorization=off < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_DISABLED
; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=preferred < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_PREFERRED_MAXBW		; RUN: opt -mtriple=aarch64-none-linux-gnu -mattr=+sve -force-target-instruction-cost=1 -loop-vectorize -S -debug-only=loop-vectorize -vectorizer-maximize-bandwidth -scalable-vectorization=preferred < %s 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK_SCALABLE_PREFERRED_MAXBW

; Test that the MaxVF for the following loop, that has no dependence distances,		; Test that the MaxVF for the following loop, that has no dependence distances,
; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16		; is calculated as vscale x 4 (max legal SVE vector size) or vscale x 16
; (maximized bandwidth for i8 in the loop).		; (maximized bandwidth for i8 in the loop).
define void @test0(i32* %a, i8* %b, i32* %c) #0 {		define void @test0(i32* %a, i8* %b, i32* %c) #0 {
; CHECK: LV: Checking a loop in "test0"		; CHECK: LV: Checking a loop in "test0"
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON: LV: Selecting VF: 4		; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 4
; CHECK_SCALABLE_PREFERRED: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_PREFERRED: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_PREFERRED: LV: Selecting VF: vscale x 4		; CHECK_SCALABLE_PREFERRED: LV: Selecting VF: vscale x 4
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4
; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Found feasible scalable VF = vscale x 16		; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Found feasible scalable VF = vscale x 16
; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Selecting VF: vscale x 16		; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Selecting VF: vscale x 16
entry:		entry:
br label %loop		br label %loop
Show All 16 Lines	exit:
ret void		ret void
}		}

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 64 elements, is calculated as (maxvscale = 16) * 4.		; of 64 elements, is calculated as (maxvscale = 16) * 4.
define void @test1(i32* %a, i8* %b) #0 {		define void @test1(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in "test1"		; CHECK: LV: Checking a loop in "test1"
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_ON: LV: Selecting VF: 4		; CHECK_SCALABLE_ON: LV: Selecting VF: vscale x 4
; CHECK_SCALABLE_PREFERRED: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_PREFERRED: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_PREFERRED: LV: Selecting VF: vscale x 4		; CHECK_SCALABLE_PREFERRED: LV: Selecting VF: vscale x 4
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4
; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Found feasible scalable VF = vscale x 4		; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Found feasible scalable VF = vscale x 4
; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop
Show All 19 Lines

; Test that the MaxVF for the following loop, with a dependence distance		; Test that the MaxVF for the following loop, with a dependence distance
; of 32 elements, is calculated as (maxvscale = 16) * 2.		; of 32 elements, is calculated as (maxvscale = 16) * 2.
define void @test2(i32* %a, i8* %b) #0 {		define void @test2(i32* %a, i8* %b) #0 {
; CHECK: LV: Checking a loop in "test2"		; CHECK: LV: Checking a loop in "test2"
; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 2		; CHECK_SCALABLE_ON: LV: Found feasible scalable VF = vscale x 2
; CHECK_SCALABLE_ON: LV: Selecting VF: 4		; CHECK_SCALABLE_ON: LV: Selecting VF: 4
; CHECK_SCALABLE_PREFERRED: LV: Found feasible scalable VF = vscale x 2		; CHECK_SCALABLE_PREFERRED: LV: Found feasible scalable VF = vscale x 2
; CHECK_SCALABLE_PREFERRED: LV: Selecting VF: 4		; CHECK_SCALABLE_PREFERRED: LV: Selecting VF: vscale x 2
; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF		; CHECK_SCALABLE_DISABLED-NOT: LV: Found feasible scalable VF
; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4		; CHECK_SCALABLE_DISABLED: LV: Selecting VF: 4
; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Found feasible scalable VF = vscale x 2		; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Found feasible scalable VF = vscale x 2
; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Selecting VF: 16		; CHECK_SCALABLE_PREFERRED_MAXBW: LV: Selecting VF: 16
entry:		entry:
br label %loop		br label %loop

loop:		loop:
▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/scalable-vf-hint.ll

	Show First 20 Lines • Show All 181 Lines • ▼ Show 20 Lines
	; Max fixed VF=32, Max scalable VF=2, unsafe to vectorize.			; Max fixed VF=32, Max scalable VF=2, unsafe to vectorize.

	; CHECK-DBG-LABEL: LV: Checking a loop in "test4"			; CHECK-DBG-LABEL: LV: Checking a loop in "test4"
	; CHECK-DBG: LV: Scalable vectorization is available			; CHECK-DBG: LV: Scalable vectorization is available
	; CHECK-DBG: LV: The max safe scalable VF is: vscale x 2.			; CHECK-DBG: LV: The max safe scalable VF is: vscale x 2.
	; CHECK-DBG: LV: User VF=vscale x 4 is unsafe. Ignoring scalable UserVF.			; CHECK-DBG: LV: User VF=vscale x 4 is unsafe. Ignoring scalable UserVF.
	; CHECK-DBG: remark: <unknown>:0:0: User-specified vectorization factor vscale x 4 is unsafe. Ignoring the hint to let the compiler pick a more suitable value.			; CHECK-DBG: remark: <unknown>:0:0: User-specified vectorization factor vscale x 4 is unsafe. Ignoring the hint to let the compiler pick a more suitable value.
	; CHECK-DBG: Found feasible scalable VF = vscale x 2			; CHECK-DBG: Found feasible scalable VF = vscale x 2
	; CHECK-DBG: LV: Selecting VF: 4.			; CHECK-DBG: LV: Selecting VF: vscale x 2.
	; CHECK-LABEL: @test4			; CHECK-LABEL: @test4
	; CHECK: <4 x i32>			; CHECK: <vscale x 2 x i32>
	define void @test4(i32* %a, i32* %b) #0 {			define void @test4(i32* %a, i32* %b) #0 {
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
	%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv			%arrayidx = getelementptr inbounds i32, i32* %a, i64 %iv
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	▲ Show 20 Lines • Show All 184 Lines • Show Last 20 Lines