This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel] Improve cost for fsqrt intrinsics.
AbandonedPublic

Authored by mcrosier on Jan 3 2017, 2:08 PM.

Download Raw Diff

Details

Reviewers

RKSimon
MatzeB
mkuper
mssimpso

Summary

This patch changes the cost of sqrt instrinsics for AArch64.

I have very limited knowledge of the cost model, so I tried to pick fairly conservative values as a starting point. In looking at the ongoing work on the X86 side it appears these values reflect the latency of the instruction. However, a discussion with @mssimpso led me to believe the AArch64 cost model doesn't directly use the instruction latencies.

Any input here would be greatly appreciated..

This change causes a hand full of additional cases in SPEC (e.g., povray) to be SLP vectorized. In fact, this may only change codegen when targeting Kryo where insert and extract element operations are cheaper than most other sub-targets.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 82945.Jan 3 2017, 2:08 PM

mcrosier retitled this revision from to [AArch64][CostModel] Improve cost for fsqrt intrinsics..

mcrosier updated this object.

mcrosier added reviewers: mkuper, RKSimon, mssimpso, MatzeB.

mcrosier added subscribers: llvm-commits, mssimpso.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptJan 3 2017, 2:08 PM

I have very limited knowledge of the cost model, so I tried to pick fairly conservative values as a starting point. In looking at the ongoing work on the X86 side it appears these values reflect the latency of the instruction. However, a discussion with @mssimpso led me to believe the AArch64 cost model doesn't directly use the instruction latencies.

At the TTI level the costs should represent throughput, as that is more useful for determining the benefit of vectorization; for fsqrt/fdiv units this is often equal (or almost equal) to the latency. Additionally it might get more complicated if the sqrt/div unit is only a 64-bit ALU and you're processing 128-bit vectors but from your example costs it doesn't look it.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
504	I don't know the range of costs that AARCH64 cores can have here - for x86 we tend to qualify these by mentioning the core type that we used for the costs in a comment. But AARCH64 is younger so might still be more consistent!

fhahn edited edge metadata.Jan 4 2017, 4:16 AM

fhahn added a subscriber: fhahn.

Thanks for the feedback, Simon. I'll experiment with using costs that are more representative of throughput as you suggest.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
504	I think this makes sense, but the AArch64 backend has only a single generation of SIMD instructions (excluding the v8.[1\|2]a extensions and the recently announced scalable vector extensions (VSE)). We might consider predicating the logic based on the specific sub-target (e.g., Kryo, Cortex-A57) as opposed to the SIMD generation (e.g., SSE, MMX, AVX2) for AArch64.. at least in those cases where the latencies vary greatly between subtargets.

RKSimon added inline comments.Jan 6 2017, 5:39 AM

test/Transforms/SLPVectorizer/AArch64/intrinsic-cost-model.ll
1	Not sure if you're interested but utils\update_test_checks.py could be used to auto-generate this. Also, why call it intrinsic-cost-model.ll and not just sqrt.ll or intrinsics.ll ?
20	Since this is a SLPVectorizer test are you gaining anything by giving it a loop to vectorizer instead of (simpler) flat code?
44	Add a float test?

-Address a few of Simons comments.
-Specialize the cost specifically for Kryo latencies, which are fairly different from the other targets I've looked at.

I'm going to hand this work off to @bmakam, so I'm blocking this until he has had time to review the current state and decide if this patch is reasonable.

mcrosier abandoned this revision.Jan 24 2017, 9:49 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

5 lines

AArch64TargetTransformInfo.cpp

34 lines

test/

Analysis/

CostModel/

AArch64/

arith-fp.ll

27 lines

Transforms/

SLPVectorizer/

AArch64/

intrinsic-cost-model.ll

45 lines

Diff 82945

lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	public:

int getAddressComputationCost(Type *Ty, bool IsComplex);		int getAddressComputationCost(Type *Ty, bool IsComplex);

int getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy);		int getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy);

int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		int getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace);		unsigned AddressSpace);

		int getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
		ArrayRef<Type *> Tys, FastMathFlags FMF);
		int getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
		ArrayRef<Value *> Args, FastMathFlags FMF);

int getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys);		int getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys);

void getUnrollingPreferences(Loop *L, TTI::UnrollingPreferences &UP);		void getUnrollingPreferences(Loop *L, TTI::UnrollingPreferences &UP);

Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		Value getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType);		Type *ExpectedType);

bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info);		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info);
Show All 18 Lines

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 487 Lines • ▼ Show 20 Lines	if (Src->isVectorTy() && Src->getVectorElementType()->isIntegerTy(8) &&
unsigned NumVectorizableInstsToAmortize = NumVecElts * 2;		unsigned NumVectorizableInstsToAmortize = NumVecElts * 2;
// We generate 2 instructions per vector element.		// We generate 2 instructions per vector element.
return NumVectorizableInstsToAmortize * NumVecElts * 2;		return NumVectorizableInstsToAmortize * NumVecElts * 2;
}		}

return LT.first;		return LT.first;
}		}

		int AArch64TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
		ArrayRef<Type *> Tys, FastMathFlags FMF) {
		static const CostTblEntry IntrinsicCostTbl[] = {
		{ ISD::FSQRT, MVT::f32, 4 },
		{ ISD::FSQRT, MVT::v2f32, 4 },
		{ ISD::FSQRT, MVT::v4f32, 4 },
		{ ISD::FSQRT, MVT::f64, 5 },
		{ ISD::FSQRT, MVT::v2f64, 5 },
		};
		RKSimonUnsubmitted Not Done Reply Inline Actions I don't know the range of costs that AARCH64 cores can have here - for x86 we tend to qualify these by mentioning the core type that we used for the costs in a comment. But AARCH64 is younger so might still be more consistent! RKSimon: I don't know the range of costs that AARCH64 cores can have here - for x86 we tend to qualify…
		mcrosierAuthorUnsubmitted Not Done Reply Inline Actions I think this makes sense, but the AArch64 backend has only a single generation of SIMD instructions (excluding the v8.[1\|2]a extensions and the recently announced scalable vector extensions (VSE)). We might consider predicating the logic based on the specific sub-target (e.g., Kryo, Cortex-A57) as opposed to the SIMD generation (e.g., SSE, MMX, AVX2) for AArch64.. at least in those cases where the latencies vary greatly between subtargets. mcrosier: I think this makes sense, but the AArch64 backend has only a single generation of SIMD…

		unsigned ISD = ISD::DELETED_NODE;
		switch (IID) {
		default:
		break;
		case Intrinsic::sqrt:
		ISD = ISD::FSQRT;
		break;
		}

		// Legalize the type.
		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);
		MVT MTy = LT.second;

		// Attempt to lookup cost.
		if (const auto *Entry = CostTableLookup(IntrinsicCostTbl, ISD, MTy))
		return LT.first * Entry->Cost;
		return BaseT::getIntrinsicInstrCost(IID, RetTy, Tys, FMF);
		}

		int AArch64TTIImpl::getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
		ArrayRef<Value *> Args, FastMathFlags FMF) {
		return BaseT::getIntrinsicInstrCost(IID, RetTy, Args, FMF);
		}

int AArch64TTIImpl::getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,		int AArch64TTIImpl::getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
unsigned Factor,		unsigned Factor,
ArrayRef<unsigned> Indices,		ArrayRef<unsigned> Indices,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace) {		unsigned AddressSpace) {
assert(Factor >= 2 && "Invalid interleave factor");		assert(Factor >= 2 && "Invalid interleave factor");
assert(isa<VectorType>(VecTy) && "Expect a vector type");		assert(isa<VectorType>(VecTy) && "Expect a vector type");

▲ Show 20 Lines • Show All 140 Lines • Show Last 20 Lines

test/Analysis/CostModel/AArch64/arith-fp.ll

This file was added.

				; RUN: opt < %s -enable-no-nans-fp-math -cost-model -analyze \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				define i32 @fsqrt(i32 %arg) {
				%F32 = call float @llvm.sqrt.f32(float undef)
				%V2F32 = call <2 x float> @llvm.sqrt.v2f32(<2 x float> undef)
				%V4F32 = call <4 x float> @llvm.sqrt.v4f32(<4 x float> undef)
				; CHECK: cost of 4 {{.*}} %F32 = call float @llvm.sqrt.f32
				; CHECK: cost of 4 {{.*}} %V2F32 = call <2 x float> @llvm.sqrt.v2f32
				; CHECK: cost of 4 {{.*}} %V4F32 = call <4 x float> @llvm.sqrt.v4f32

				%F64 = call double @llvm.sqrt.f64(double undef)
				%V2F64 = call <2 x double> @llvm.sqrt.v2f64(<2 x double> undef)
				; CHECK: cost of 5 {{.*}} %F64 = call double @llvm.sqrt.f64
				; CHECK: cost of 5 {{.*}} %V2F64 = call <2 x double> @llvm.sqrt.v2f64

				ret i32 undef
				}

				declare float @llvm.sqrt.f32(float)
				declare <2 x float> @llvm.sqrt.v2f32(<2 x float>)
				declare <4 x float> @llvm.sqrt.v4f32(<4 x float>)

				declare double @llvm.sqrt.f64(double)
				declare <2 x double> @llvm.sqrt.v2f64(<2 x double>)

test/Transforms/SLPVectorizer/AArch64/intrinsic-cost-model.ll

This file was added.

				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=aarch64-unknown-linux-gnu -mcpu=kryo \| FileCheck %s
				RKSimonUnsubmitted Not Done Reply Inline Actions Not sure if you're interested but utils\update_test_checks.py could be used to auto-generate this. Also, why call it intrinsic-cost-model.ll and not just sqrt.ll or intrinsics.ll ? RKSimon: Not sure if you're interested but utils\update_test_checks.py could be used to auto-generate…

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; CHECK-LABEL: @test1
				; CHECK: fmul fast <2 x double>
				; CHECK: fdiv fast <2 x double>
				; CHECK: call fast <2 x double> @llvm.sqrt.v2f64(<2 x double>
				; CHECK: call fast <2 x double> @llvm.sqrt.v2f64(<2 x double>
				; CHECK: ret double
				define double @test1(double %t1, double %t2, double %t3, double %z1, double %z2) {
				entry:
				%cmp = fcmp fast une double %t1, 0.000000e+00
				%cmp1 = fcmp fast une double %t2, 0.000000e+00
				%or.cond = and i1 %cmp, %cmp1
				%cmp3 = fcmp fast une double %t3, 0.000000e+00
				%or.cond20 = and i1 %or.cond, %cmp3
				br i1 %or.cond20, label %if.then, label %return

				RKSimonUnsubmitted Not Done Reply Inline Actions Since this is a SLPVectorizer test are you gaining anything by giving it a loop to vectorizer instead of (simpler) flat code? RKSimon: Since this is a SLPVectorizer test are you gaining anything by giving it a loop to vectorizer…
				if.then: ; preds = %entry
				%mul = fmul fast double %t1, 2.000000e+00
				%mul4 = fmul fast double %mul, %z1
				%div = fdiv fast double %mul4, %t2
				%0 = tail call fast double @llvm.sqrt.f64(double %div)
				%div7 = fdiv fast double %mul4, %t3
				%1 = tail call fast double @llvm.sqrt.f64(double %div7)
				%mul9 = fmul fast double %mul, %z2
				%div10 = fdiv fast double %mul9, %t2
				%2 = tail call fast double @llvm.sqrt.f64(double %div10)
				%div13 = fdiv fast double %mul9, %t3
				%3 = tail call fast double @llvm.sqrt.f64(double %div13)
				%cmp14 = fcmp fast ogt double %0, %2
				%cond = select i1 %cmp14, double %0, double %2
				%cmp15 = fcmp fast ogt double %1, %3
				%cond19 = select i1 %cmp15, double %1, double %3
				%add = fadd fast double %cond, %cond19
				br label %return

				return: ; preds = %entry, %if.then
				%retval.0 = phi double [ %add, %if.then ], [ 0.000000e+00, %entry ]
				ret double %retval.0
				}

				RKSimonUnsubmitted Not Done Reply Inline Actions Add a float test? RKSimon: Add a float test?
				declare double @llvm.sqrt.f64(double)

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel] Improve cost for fsqrt intrinsics.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 82945

lib/Target/AArch64/AArch64TargetTransformInfo.h

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

test/Analysis/CostModel/AArch64/arith-fp.ll

test/Transforms/SLPVectorizer/AArch64/intrinsic-cost-model.ll

[AArch64][CostModel] Improve cost for fsqrt intrinsics.
AbandonedPublic