This is an archive of the discontinued LLVM Phabricator instance.

[Analysis][SVE] Make the costs for gathers/scatters/ordered reductions less pessimistic
AbandonedPublic

Authored by david-arm on Sep 1 2021, 8:32 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
dmgreen
peterwaller-arm
efriedma

Summary

The cost model for gathers, scatters and ordered reductions is based on
a pessimistic algorithm that involves scalarising the operations. We use
the architectural maximum value of vscale to determine the worst case
number of elements. However, in practice the maximum vector length in
available hardware is currently 512 bits so I've modified

AArch64TargetTransform::getMaxNumElements

to allow callers to set the worst case vscale value to something more
pragmatic.

In the long term we want to come up with a more realistic cost model to
reflect the fact the operations are unlikely to be completely scalarised.
However, for now this minor tweak will permit many more loops to be
vectorised using scalable vectors.

Diff Detail

Event Timeline

david-arm created this revision.Sep 1 2021, 8:32 AM

Herald added a reviewer: efriedma. · View Herald TranscriptSep 1 2021, 8:32 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

david-arm requested review of this revision.Sep 1 2021, 8:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 1 2021, 8:32 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

david-arm added a parent revision: D109055: [SVE][NFC] Add SVE cost model tests for gathers/scatters.Sep 1 2021, 8:32 AM

Harbormaster completed remote builds in B122090: Diff 369939.Sep 1 2021, 8:33 AM

david-arm retitled this revision from [SVE] Make the costs for gathers/scatters/ordered reductions less pessimistic to [Analysis][SVE] Make the costs for gathers/scatters/ordered reductions less pessimistic.Sep 1 2021, 8:34 AM

Matt added a subscriber: Matt.Sep 2 2021, 7:01 AM

Rebase.

Harbormaster completed remote builds in B122460: Diff 370505.Sep 3 2021, 2:53 AM

sdesmalen added inline comments.Sep 6 2021, 5:02 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
142	In practice the vscale_range attribute is added by Clang to all functions now, so this change is artificial in that it only changes the cost for the unit-tests (which don't specify vscale_range). I think what you're actually after is a function that returns the median value between min and max, e.g. for a vscale_range(0, 16) it chooses 8, and and for vscale_range(8, 8) it also chooses 8. It would be better if we don't start tuning for specific bit-widths based on implementations that are available today.

Abandoning in the patch in favour of a different approach - see D110259

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

10 lines

AArch64TargetTransformInfo.cpp

4 lines

test/

Analysis/

CostModel/

AArch64/

sve-intrinsics.ll

36 lines

Transforms/

LoopVectorize/

AArch64/

sve-strict-fadd-cost.ll

8 lines

Diff 369939

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	public:
unsigned getMinVectorRegisterBitWidth() const {		unsigned getMinVectorRegisterBitWidth() const {
return ST->getMinVectorRegisterBitWidth();		return ST->getMinVectorRegisterBitWidth();
}		}


/// Try to return an estimate cost factor that can be used as a multiplier		/// Try to return an estimate cost factor that can be used as a multiplier
/// when scalarizing an operation for a vector with ElementCount \p VF.		/// when scalarizing an operation for a vector with ElementCount \p VF.
/// For scalable vectors this currently takes the most pessimistic view based		/// For scalable vectors this currently takes the most pessimistic view based
/// upon the maximum possible value for vscale.		/// upon the maximum possible value for vscale, which is determined as follows:
unsigned getMaxNumElements(ElementCount VF,		/// 1. If the Function \p F contains a vscale_range attribute then we use the
		/// max value from that, otherwise
		/// 2. Use the \p DefaultMaxVscale value passed in by the user, which is 16
		/// if unspecified.
		unsigned getMaxNumElements(ElementCount VF, unsigned DefaultMaxVscale = 16,
const Function *F = nullptr) const {		const Function *F = nullptr) const {
if (!VF.isScalable())		if (!VF.isScalable())
return VF.getFixedValue();		return VF.getFixedValue();

unsigned MaxNumVScale = 16;		unsigned MaxNumVScale = DefaultMaxVscale;
		sdesmalenUnsubmitted Not Done Reply Inline Actions In practice the vscale_range attribute is added by Clang to all functions now, so this change is artificial in that it only changes the cost for the unit-tests (which don't specify vscale_range). I think what you're actually after is a function that returns the median value between min and max, e.g. for a vscale_range(0, 16) it chooses 8, and and for vscale_range(8, 8) it also chooses 8. It would be better if we don't start tuning for specific bit-widths based on implementations that are available today. sdesmalen: In practice the vscale_range attribute is added by Clang to all functions now, so this change…
if (F && F->hasFnAttribute(Attribute::VScaleRange)) {		if (F && F->hasFnAttribute(Attribute::VScaleRange)) {
unsigned VScaleMax =		unsigned VScaleMax =
F->getFnAttribute(Attribute::VScaleRange).getVScaleRangeArgs().second;		F->getFnAttribute(Attribute::VScaleRange).getVScaleRangeArgs().second;
if (VScaleMax > 0)		if (VScaleMax > 0)
MaxNumVScale = VScaleMax;		MaxNumVScale = VScaleMax;
}		}

return MaxNumVScale * VF.getKnownMinValue();		return MaxNumVScale * VF.getKnownMinValue();
▲ Show 20 Lines • Show All 191 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,538 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
// sufficiently reliable.		// sufficiently reliable.
if (cast<VectorType>(DataTy)->getElementCount() ==		if (cast<VectorType>(DataTy)->getElementCount() ==
ElementCount::getScalable(1))		ElementCount::getScalable(1))
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();

ElementCount LegalVF = LT.second.getVectorElementCount();		ElementCount LegalVF = LT.second.getVectorElementCount();
InstructionCost MemOpCost =		InstructionCost MemOpCost =
getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind, I);		getMemoryOpCost(Opcode, VT->getElementType(), Alignment, 0, CostKind, I);
return LT.first * MemOpCost * getMaxNumElements(LegalVF, I->getFunction());		return LT.first * MemOpCost * getMaxNumElements(LegalVF, 4, I->getFunction());
}		}

bool AArch64TTIImpl::useNeonVector(const Type *Ty) const {		bool AArch64TTIImpl::useNeonVector(const Type *Ty) const {
return isa<FixedVectorType>(Ty) && !ST->useSVEForFixedLengthVectors();		return isa<FixedVectorType>(Ty) && !ST->useSVEForFixedLengthVectors();
}		}

InstructionCost AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,		InstructionCost AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,
MaybeAlign Alignment,		MaybeAlign Alignment,
▲ Show 20 Lines • Show All 408 Lines • ▼ Show 20 Lines	if (TTI::requiresOrderedReduction(FMF)) {
}		}

if (Opcode != Instruction::FAdd)		if (Opcode != Instruction::FAdd)
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();

auto *VTy = cast<ScalableVectorType>(ValTy);		auto *VTy = cast<ScalableVectorType>(ValTy);
InstructionCost Cost =		InstructionCost Cost =
getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);		getArithmeticInstrCost(Opcode, VTy->getScalarType(), CostKind);
Cost *= getMaxNumElements(VTy->getElementCount());		Cost *= getMaxNumElements(VTy->getElementCount(), 4);
return Cost;		return Cost;
}		}

if (isa<ScalableVectorType>(ValTy))		if (isa<ScalableVectorType>(ValTy))
return getArithmeticReductionCostSVE(Opcode, ValTy, CostKind);		return getArithmeticReductionCostSVE(Opcode, ValTy, CostKind);

std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
MVT MTy = LT.second;		MVT MTy = LT.second;
▲ Show 20 Lines • Show All 228 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/sve-intrinsics.ll

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	;
%fmax_nxv4f32 = call fast float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float> %v2)		%fmax_nxv4f32 = call fast float @llvm.vector.reduce.fmax.nxv4f32(<vscale x 4 x float> %v2)
%fmax_nxv4f64 = call fast double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double> %v3)		%fmax_nxv4f64 = call fast double @llvm.vector.reduce.fmax.nxv4f64(<vscale x 4 x double> %v3)

ret void		ret void
}		}

define void @strict_fp_reductions(<vscale x 4 x float> %v0, <vscale x 4 x double> %v1) {		define void @strict_fp_reductions(<vscale x 4 x float> %v0, <vscale x 4 x double> %v1) {
; CHECK-LABEL: 'strict_fp_reductions'		; CHECK-LABEL: 'strict_fp_reductions'
; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v0)		; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v0)
; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v1)		; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v1)
; CHECK-NEXT: Cost Model: Invalid cost for instruction: %fmul_nxv4f32 = call float @llvm.vector.reduce.fmul.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v0)		; CHECK-NEXT: Cost Model: Invalid cost for instruction: %fmul_nxv4f32 = call float @llvm.vector.reduce.fmul.nxv4f32(float 0.000000e+00, <vscale x 4 x float> %v0)
; CHECK-NEXT: Cost Model: Invalid cost for instruction: %fmul_nxv4f64 = call double @llvm.vector.reduce.fmul.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v1)		; CHECK-NEXT: Cost Model: Invalid cost for instruction: %fmul_nxv4f64 = call double @llvm.vector.reduce.fmul.nxv4f64(double 0.000000e+00, <vscale x 4 x double> %v1)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void		; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;		;
%fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.0, <vscale x 4 x float> %v0)		%fadd_nxv4f32 = call float @llvm.vector.reduce.fadd.nxv4f32(float 0.0, <vscale x 4 x float> %v0)
%fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.0, <vscale x 4 x double> %v1)		%fadd_nxv4f64 = call double @llvm.vector.reduce.fadd.nxv4f64(double 0.0, <vscale x 4 x double> %v1)
%fmul_nxv4f32 = call float @llvm.vector.reduce.fmul.nxv4f32(float 0.0, <vscale x 4 x float> %v0)		%fmul_nxv4f32 = call float @llvm.vector.reduce.fmul.nxv4f32(float 0.0, <vscale x 4 x float> %v0)
%fmul_nxv4f64 = call double @llvm.vector.reduce.fmul.nxv4f64(double 0.0, <vscale x 4 x double> %v1)		%fmul_nxv4f64 = call double @llvm.vector.reduce.fmul.nxv4f64(double 0.0, <vscale x 4 x double> %v1)
▲ Show 20 Lines • Show All 282 Lines • ▼ Show 20 Lines	;; negative Index
%splice_nxv8i1_neg = call <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1> zeroinitializer, <vscale x 8 x i1> zeroinitializer, i32 -1)		%splice_nxv8i1_neg = call <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1> zeroinitializer, <vscale x 8 x i1> zeroinitializer, i32 -1)
%splice_nxv4i1_neg = call <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1> zeroinitializer, <vscale x 4 x i1> zeroinitializer, i32 -1)		%splice_nxv4i1_neg = call <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1> zeroinitializer, <vscale x 4 x i1> zeroinitializer, i32 -1)
%splice_nxv2i1_neg = call <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1> zeroinitializer, <vscale x 2 x i1> zeroinitializer, i32 -1)		%splice_nxv2i1_neg = call <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1> zeroinitializer, <vscale x 2 x i1> zeroinitializer, i32 -1)
ret void		ret void
}		}

define void @masked_gather() {		define void @masked_gather() {
; CHECK-LABEL: 'masked_gather'		; CHECK-LABEL: 'masked_gather'
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %NXV4F64 = call <vscale x 4 x double> @llvm.masked.gather.nxv4f64.nxv4p0f64(<vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x double> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %NXV4F64 = call <vscale x 4 x double> @llvm.masked.gather.nxv4f64.nxv4p0f64(<vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x double> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %NXV2F64 = call <vscale x 2 x double> @llvm.masked.gather.nxv2f64.nxv2p0f64(<vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x double> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %NXV2F64 = call <vscale x 2 x double> @llvm.masked.gather.nxv2f64.nxv2p0f64(<vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x double> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %NXV8F32 = call <vscale x 8 x float> @llvm.masked.gather.nxv8f32.nxv8p0f32(<vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x float> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %NXV8F32 = call <vscale x 8 x float> @llvm.masked.gather.nxv8f32.nxv8p0f32(<vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x float> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %NXV4F32 = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0f32(<vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x float> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %NXV4F32 = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32.nxv4p0f32(<vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x float> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %NXV2F32 = call <vscale x 2 x float> @llvm.masked.gather.nxv2f32.nxv2p0f32(<vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x float> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %NXV2F32 = call <vscale x 2 x float> @llvm.masked.gather.nxv2f32.nxv2p0f32(<vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x float> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 256 for instruction: %NXV16I16 = call <vscale x 16 x i16> @llvm.masked.gather.nxv16i16.nxv16p0i16(<vscale x 16 x i16*> undef, i32 1, <vscale x 16 x i1> undef, <vscale x 16 x i16> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %NXV16I16 = call <vscale x 16 x i16> @llvm.masked.gather.nxv16i16.nxv16p0i16(<vscale x 16 x i16*> undef, i32 1, <vscale x 16 x i1> undef, <vscale x 16 x i16> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: %NXV8I16 = call <vscale x 8 x i16> @llvm.masked.gather.nxv8i16.nxv8p0i16(<vscale x 8 x i16*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x i16> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: %NXV8I16 = call <vscale x 8 x i16> @llvm.masked.gather.nxv8i16.nxv8p0i16(<vscale x 8 x i16*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x i16> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: %NXV4I16 = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16.nxv4p0i16(<vscale x 4 x i16*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x i16> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %NXV4I16 = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16.nxv4p0i16(<vscale x 4 x i16*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x i16> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void		; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;		;
%NXV4F64 = call <vscale x 4 x double> @llvm.masked.gather.nxv4f64(<vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x double> undef)		%NXV4F64 = call <vscale x 4 x double> @llvm.masked.gather.nxv4f64(<vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x double> undef)
%NXV2F64 = call <vscale x 2 x double> @llvm.masked.gather.nxv2f64(<vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x double> undef)		%NXV2F64 = call <vscale x 2 x double> @llvm.masked.gather.nxv2f64(<vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x double> undef)

%NXV8F32 = call <vscale x 8 x float> @llvm.masked.gather.nxv8f32(<vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x float> undef)		%NXV8F32 = call <vscale x 8 x float> @llvm.masked.gather.nxv8f32(<vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x float> undef)
%NXV4F32 = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x float> undef)		%NXV4F32 = call <vscale x 4 x float> @llvm.masked.gather.nxv4f32(<vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x float> undef)
%NXV2F32 = call <vscale x 2 x float> @llvm.masked.gather.nxv2f32(<vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x float> undef)		%NXV2F32 = call <vscale x 2 x float> @llvm.masked.gather.nxv2f32(<vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef, <vscale x 2 x float> undef)

%NXV16I16 = call <vscale x 16 x i16> @llvm.masked.gather.nxv16i16(<vscale x 16 x i16*> undef, i32 1, <vscale x 16 x i1> undef, <vscale x 16 x i16> undef)		%NXV16I16 = call <vscale x 16 x i16> @llvm.masked.gather.nxv16i16(<vscale x 16 x i16*> undef, i32 1, <vscale x 16 x i1> undef, <vscale x 16 x i16> undef)
%NXV8I16 = call <vscale x 8 x i16> @llvm.masked.gather.nxv8i16(<vscale x 8 x i16*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x i16> undef)		%NXV8I16 = call <vscale x 8 x i16> @llvm.masked.gather.nxv8i16(<vscale x 8 x i16*> undef, i32 1, <vscale x 8 x i1> undef, <vscale x 8 x i16> undef)
%NXV4I16 = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16(<vscale x 4 x i16*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x i16> undef)		%NXV4I16 = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16(<vscale x 4 x i16*> undef, i32 1, <vscale x 4 x i1> undef, <vscale x 4 x i16> undef)

ret void		ret void
}		}

define void @masked_scatter() {		define void @masked_scatter() {
; CHECK-LABEL: 'masked_scatter'		; CHECK-LABEL: 'masked_scatter'
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: call void @llvm.masked.scatter.nxv4f64.nxv4p0f64(<vscale x 4 x double> undef, <vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: call void @llvm.masked.scatter.nxv4f64.nxv4p0f64(<vscale x 4 x double> undef, <vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: call void @llvm.masked.scatter.nxv2f64.nxv2p0f64(<vscale x 2 x double> undef, <vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: call void @llvm.masked.scatter.nxv2f64.nxv2p0f64(<vscale x 2 x double> undef, <vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: call void @llvm.masked.scatter.nxv8f32.nxv8p0f32(<vscale x 8 x float> undef, <vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: call void @llvm.masked.scatter.nxv8f32.nxv8p0f32(<vscale x 8 x float> undef, <vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: call void @llvm.masked.scatter.nxv4f32.nxv4p0f32(<vscale x 4 x float> undef, <vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: call void @llvm.masked.scatter.nxv4f32.nxv4p0f32(<vscale x 4 x float> undef, <vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: call void @llvm.masked.scatter.nxv2f32.nxv2p0f32(<vscale x 2 x float> undef, <vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: call void @llvm.masked.scatter.nxv2f32.nxv2p0f32(<vscale x 2 x float> undef, <vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 256 for instruction: call void @llvm.masked.scatter.nxv16i16.nxv16p0i16(<vscale x 16 x i16> undef, <vscale x 16 x i16*> undef, i32 1, <vscale x 16 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: call void @llvm.masked.scatter.nxv16i16.nxv16p0i16(<vscale x 16 x i16> undef, <vscale x 16 x i16*> undef, i32 1, <vscale x 16 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 128 for instruction: call void @llvm.masked.scatter.nxv8i16.nxv8p0i16(<vscale x 8 x i16> undef, <vscale x 8 x i16*> undef, i32 1, <vscale x 8 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 32 for instruction: call void @llvm.masked.scatter.nxv8i16.nxv8p0i16(<vscale x 8 x i16> undef, <vscale x 8 x i16*> undef, i32 1, <vscale x 8 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 64 for instruction: call void @llvm.masked.scatter.nxv4i16.nxv4p0i16(<vscale x 4 x i16> undef, <vscale x 4 x i16*> undef, i32 1, <vscale x 4 x i1> undef)		; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: call void @llvm.masked.scatter.nxv4i16.nxv4p0i16(<vscale x 4 x i16> undef, <vscale x 4 x i16*> undef, i32 1, <vscale x 4 x i1> undef)
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void		; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
;		;
call void @llvm.masked.scatter.nxv4f64(<vscale x 4 x double> undef, <vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef)		call void @llvm.masked.scatter.nxv4f64(<vscale x 4 x double> undef, <vscale x 4 x double*> undef, i32 1, <vscale x 4 x i1> undef)
call void @llvm.masked.scatter.nxv2f64(<vscale x 2 x double> undef, <vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef)		call void @llvm.masked.scatter.nxv2f64(<vscale x 2 x double> undef, <vscale x 2 x double*> undef, i32 1, <vscale x 2 x i1> undef)

call void @llvm.masked.scatter.nxv8f32(<vscale x 8 x float> undef, <vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef)		call void @llvm.masked.scatter.nxv8f32(<vscale x 8 x float> undef, <vscale x 8 x float*> undef, i32 1, <vscale x 8 x i1> undef)
call void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float> undef, <vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef)		call void @llvm.masked.scatter.nxv4f32(<vscale x 4 x float> undef, <vscale x 4 x float*> undef, i32 1, <vscale x 4 x i1> undef)
call void @llvm.masked.scatter.nxv2f32(<vscale x 2 x float> undef, <vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef)		call void @llvm.masked.scatter.nxv2f32(<vscale x 2 x float> undef, <vscale x 2 x float*> undef, i32 1, <vscale x 2 x i1> undef)
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \			; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \
	; RUN: -scalable-vectorization=on -force-vector-width=4 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF4			; RUN: -scalable-vectorization=on -force-vector-width=4 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF4
	; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \			; RUN: opt < %s -loop-vectorize -debug -disable-output -force-ordered-reductions=true -hints-allow-reordering=false \
	; RUN: -scalable-vectorization=on -force-vector-width=8 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF8			; RUN: -scalable-vectorization=on -force-vector-width=8 -force-vector-interleave=1 -S 2>&1 \| FileCheck %s --check-prefix=CHECK-VF8

	target triple="aarch64-unknown-linux-gnu"			target triple="aarch64-unknown-linux-gnu"

	; CHECK-VF4: Found an estimated cost of 128 for VF vscale x 4 For instruction: %add = fadd float %0, %sum.07			; CHECK-VF4: Found an estimated cost of 32 for VF vscale x 4 For instruction: %add = fadd float %0, %sum.07
	; CHECK-VF8: Found an estimated cost of 256 for VF vscale x 8 For instruction: %add = fadd float %0, %sum.07			; CHECK-VF8: Found an estimated cost of 64 for VF vscale x 8 For instruction: %add = fadd float %0, %sum.07

	define float @fadd_strict32(float* noalias nocapture readonly %a, i64 %n) #0 {			define float @fadd_strict32(float* noalias nocapture readonly %a, i64 %n) #0 {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
	%arrayidx = getelementptr inbounds float, float* %a, i64 %iv			%arrayidx = getelementptr inbounds float, float* %a, i64 %iv
	%0 = load float, float* %arrayidx, align 4			%0 = load float, float* %arrayidx, align 4
	%add = fadd float %0, %sum.07			%add = fadd float %0, %sum.07
	%iv.next = add nuw nsw i64 %iv, 1			%iv.next = add nuw nsw i64 %iv, 1
	%exitcond.not = icmp eq i64 %iv.next, %n			%exitcond.not = icmp eq i64 %iv.next, %n
	br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0			br i1 %exitcond.not, label %for.end, label %for.body, !llvm.loop !0

	for.end:			for.end:
	ret float %add			ret float %add
	}			}


	; CHECK-VF4: Found an estimated cost of 128 for VF vscale x 4 For instruction: %add = fadd double %0, %sum.07			; CHECK-VF4: Found an estimated cost of 32 for VF vscale x 4 For instruction: %add = fadd double %0, %sum.07
	; CHECK-VF8: Found an estimated cost of 256 for VF vscale x 8 For instruction: %add = fadd double %0, %sum.07			; CHECK-VF8: Found an estimated cost of 64 for VF vscale x 8 For instruction: %add = fadd double %0, %sum.07

	define double @fadd_strict64(double* noalias nocapture readonly %a, i64 %n) #0 {			define double @fadd_strict64(double* noalias nocapture readonly %a, i64 %n) #0 {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%sum.07 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]			%sum.07 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
	Show All 15 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Analysis][SVE] Make the costs for gathers/scatters/ordered reductions less pessimisticAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 369939

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/sve-intrinsics.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-strict-fadd-cost.ll

[Analysis][SVE] Make the costs for gathers/scatters/ordered reductions less pessimistic
AbandonedPublic