This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Improve code size cost model
ClosedPublic

Authored by dfukalov on Oct 11 2019, 11:48 AM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm

Commits

rL375109: [AMDGPU] Improve code size cost model
rG39720575117e: [AMDGPU] Improve code size cost model

Summary

Added estimation for zero size insertelement, extractelement
and llvm.fabs operators.
Updated inline/unroll parameters default values.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dfukalov created this revision.Oct 11 2019, 11:48 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2019, 11:48 AM

Herald added subscribers: hiraditya, t-tye, tpr and 6 others. · View Herald Transcript

Harbormaster completed remote builds in B39435: Diff 224646.Oct 11 2019, 11:49 AM

arsenm added inline comments.Oct 11 2019, 12:04 PM

llvm/lib/Target/AMDGPU/AMDGPUInline.cpp
54	This is a separate change
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
698	We already report vector insert/extract as free. Why does this need to look at these specifically? What is the purpose of Operands which seems to be ignored? What uses this version? I thought the set of cost model function with specific value contexts were only used by the vectorizers
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
207	This is a separate change

dfukalov marked 3 inline comments as done.Oct 14 2019, 7:46 AM

dfukalov added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUInline.cpp
54	The parameter' default value should be updated to correspond changed cost model, to avoid performance regressions.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
698	CostModel has three estimation modes: RecipThroughput, Latency and CodeSize. Vectorizer uses the first one but inliner and unroller use code size estimations. Insert/extract and other estimations were implemented for RecipThroughput path only so e.g. inliner got wrong code size costs estimations for such instructions. The change introduces the same estimations for some trivial cases by overloading getUserCost().
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h
207	The parameter' default value should be updated to correspond changed cost model, to avoid performance regressions.

arsenm added inline comments.Oct 15 2019, 1:02 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
698	Nothing here looks target specific though? It's just forwarding the calls. Why doesn't the base implementation do this?

dfukalov marked an inline comment as done.Oct 16 2019, 5:47 AM

dfukalov added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
698	The parameters preparation and calls forwarding scheme is from base implementation of getInstructionThroughput() but we cannot say that for all targets zero cost in terms of throughtput means zero code size. Moreover, getVectorInstrCost() is target specific here and I'm going to add estimation with overloaded getShuffleCost() too.

LGTM but I still find TTI's set of cost functions incomprehensible

This revision is now accepted and ready to land.Oct 16 2019, 12:51 PM

Closed by commit rG39720575117e: [AMDGPU] Improve code size cost model (authored by dfukalov). · Explain WhyOct 17 2019, 5:16 AM

This revision was automatically updated to reflect the committed changes.

rampitec mentioned this in D68873: [AMDGPU] Amend target loop unroll defaults.Oct 17 2019, 9:09 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUInline.cpp

2 lines

AMDGPUTargetTransformInfo.h

3 lines

AMDGPUTargetTransformInfo.cpp

35 lines

test/

Analysis/

CostModel/

AMDGPU/

extractelement.ll

11 lines

fabs.ll

21 lines

insertelement.ll

10 lines

Diff 225408

llvm/lib/Target/AMDGPU/AMDGPUInline.cpp

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	// it into registers we gain nothing by aggressively inlining functions for that			// it into registers we gain nothing by aggressively inlining functions for that
	// heuristic.			// heuristic.
	static cl::opt<unsigned>			static cl::opt<unsigned>
	ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden, cl::init(256),			ArgAllocaCutoff("amdgpu-inline-arg-alloca-cutoff", cl::Hidden, cl::init(256),
	cl::desc("Maximum alloca size to use for inline cost"));			cl::desc("Maximum alloca size to use for inline cost"));

	// Inliner constraint to achieve reasonable compilation time			// Inliner constraint to achieve reasonable compilation time
	static cl::opt<size_t>			static cl::opt<size_t>
	MaxBB("amdgpu-inline-max-bb", cl::Hidden, cl::init(300),			MaxBB("amdgpu-inline-max-bb", cl::Hidden, cl::init(1100),
				arsenmUnsubmitted Not Done Reply Inline Actions This is a separate change arsenm: This is a separate change
				dfukalovAuthorUnsubmitted Done Reply Inline Actions The parameter' default value should be updated to correspond changed cost model, to avoid performance regressions. dfukalov: The parameter' default value should be updated to correspond changed cost model, to avoid…
	cl::desc("Maximum BB number allowed in a function after inlining"			cl::desc("Maximum BB number allowed in a function after inlining"
	" (compile time constraint)"));			" (compile time constraint)"));

	namespace {			namespace {

	class AMDGPUInliner : public LegacyInlinerBase {			class AMDGPUInliner : public LegacyInlinerBase {

	public:			public:
	▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 198 Lines • ▼ Show 20 Lines	public:
unsigned getVectorSplitCost() { return 0; }		unsigned getVectorSplitCost() { return 0; }

unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,		unsigned getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp);		Type *SubTp);

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

unsigned getInliningThresholdMultiplier() { return 7; }		unsigned getInliningThresholdMultiplier() { return 9; }
		arsenmUnsubmitted Not Done Reply Inline Actions This is a separate change arsenm: This is a separate change
		dfukalovAuthorUnsubmitted Done Reply Inline Actions The parameter' default value should be updated to correspond changed cost model, to avoid performance regressions. dfukalov: The parameter' default value should be updated to correspond changed cost model, to avoid…

int getInlinerVectorBonusPercent() { return 0; }		int getInlinerVectorBonusPercent() { return 0; }

int getArithmeticReductionCost(unsigned Opcode,		int getArithmeticReductionCost(unsigned Opcode,
Type *Ty,		Type *Ty,
bool IsPairwise);		bool IsPairwise);
int getMinMaxReductionCost(Type Ty, Type CondTy,		int getMinMaxReductionCost(Type Ty, Type CondTy,
bool IsPairwiseForm,		bool IsPairwiseForm,
bool IsUnsigned);		bool IsUnsigned);
		unsigned getUserCost(const User U, ArrayRef<const Value > Operands);
};		};

class R600TTIImpl final : public BasicTTIImplBase<R600TTIImpl> {		class R600TTIImpl final : public BasicTTIImplBase<R600TTIImpl> {
using BaseT = BasicTTIImplBase<R600TTIImpl>;		using BaseT = BasicTTIImplBase<R600TTIImpl>;
using TTI = TargetTransformInfo;		using TTI = TargetTransformInfo;

friend BaseT;		friend BaseT;

Show All 37 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "AMDGPUtti"		#define DEBUG_TYPE "AMDGPUtti"

static cl::opt<unsigned> UnrollThresholdPrivate(		static cl::opt<unsigned> UnrollThresholdPrivate(
"amdgpu-unroll-threshold-private",		"amdgpu-unroll-threshold-private",
cl::desc("Unroll threshold for AMDGPU if private memory used in a loop"),		cl::desc("Unroll threshold for AMDGPU if private memory used in a loop"),
cl::init(2500), cl::Hidden);		cl::init(2000), cl::Hidden);

static cl::opt<unsigned> UnrollThresholdLocal(		static cl::opt<unsigned> UnrollThresholdLocal(
"amdgpu-unroll-threshold-local",		"amdgpu-unroll-threshold-local",
cl::desc("Unroll threshold for AMDGPU if local memory used in a loop"),		cl::desc("Unroll threshold for AMDGPU if local memory used in a loop"),
cl::init(1000), cl::Hidden);		cl::init(1000), cl::Hidden);

static cl::opt<unsigned> UnrollThresholdIf(		static cl::opt<unsigned> UnrollThresholdIf(
"amdgpu-unroll-threshold-if",		"amdgpu-unroll-threshold-if",
▲ Show 20 Lines • Show All 619 Lines • ▼ Show 20 Lines	bool GCNTTIImpl::areInlineCompatible(const Function *Caller,
return CallerMode.isInlineCompatible(CalleeMode);		return CallerMode.isInlineCompatible(CalleeMode);
}		}

void GCNTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void GCNTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
CommonTTI.getUnrollingPreferences(L, SE, UP);		CommonTTI.getUnrollingPreferences(L, SE, UP);
}		}

		unsigned GCNTTIImpl::getUserCost(const User *U,
		ArrayRef<const Value *> Operands) {
		// Estimate extractelement elimination
		arsenmUnsubmitted Not Done Reply Inline Actions We already report vector insert/extract as free. Why does this need to look at these specifically? What is the purpose of Operands which seems to be ignored? What uses this version? I thought the set of cost model function with specific value contexts were only used by the vectorizers arsenm: We already report vector insert/extract as free. Why does this need to look at these…
		dfukalovAuthorUnsubmitted Done Reply Inline Actions CostModel has three estimation modes: RecipThroughput, Latency and CodeSize. Vectorizer uses the first one but inliner and unroller use code size estimations. Insert/extract and other estimations were implemented for RecipThroughput path only so e.g. inliner got wrong code size costs estimations for such instructions. The change introduces the same estimations for some trivial cases by overloading getUserCost(). dfukalov: CostModel has three estimation modes: RecipThroughput, Latency and CodeSize. Vectorizer uses…
		arsenmUnsubmitted Not Done Reply Inline Actions Nothing here looks target specific though? It's just forwarding the calls. Why doesn't the base implementation do this? arsenm: Nothing here looks target specific though? It's just forwarding the calls. Why doesn't the base…
		dfukalovAuthorUnsubmitted Done Reply Inline Actions The parameters preparation and calls forwarding scheme is from base implementation of getInstructionThroughput() but we cannot say that for all targets zero cost in terms of throughtput means zero code size. Moreover, getVectorInstrCost() is target specific here and I'm going to add estimation with overloaded getShuffleCost() too. dfukalov: The parameters preparation and calls forwarding scheme is from base implementation of…
		if (const ExtractElementInst *EE = dyn_cast<ExtractElementInst>(U)) {
		ConstantInt *CI = dyn_cast<ConstantInt>(EE->getOperand(1));
		unsigned Idx = -1;
		if (CI)
		Idx = CI->getZExtValue();
		return getVectorInstrCost(EE->getOpcode(), EE->getOperand(0)->getType(),
		Idx);
		}

		// Estimate insertelement elimination
		if (const InsertElementInst *IE = dyn_cast<InsertElementInst>(U)) {
		ConstantInt *CI = dyn_cast<ConstantInt>(IE->getOperand(2));
		unsigned Idx = -1;
		if (CI)
		Idx = CI->getZExtValue();
		return getVectorInstrCost(IE->getOpcode(), IE->getType(), Idx);
		}

		// Estimate different intrinsics, e.g. llvm.fabs
		if (const IntrinsicInst *II = dyn_cast<IntrinsicInst>(U)) {
		SmallVector<Value *, 4> Args(II->arg_operands());
		FastMathFlags FMF;
		if (auto *FPMO = dyn_cast<FPMathOperator>(II))
		FMF = FPMO->getFastMathFlags();
		return getIntrinsicInstrCost(II->getIntrinsicID(), II->getType(), Args,
		FMF);
		}
		return BaseT::getUserCost(U, Operands);
		}

unsigned R600TTIImpl::getHardwareNumberOfRegisters(bool Vec) const {		unsigned R600TTIImpl::getHardwareNumberOfRegisters(bool Vec) const {
return 4 * 128; // XXX - 4 channels. Should these count as vector instead?		return 4 * 128; // XXX - 4 channels. Should these count as vector instead?
}		}

unsigned R600TTIImpl::getNumberOfRegisters(bool Vec) const {		unsigned R600TTIImpl::getNumberOfRegisters(bool Vec) const {
return getHardwareNumberOfRegisters(Vec);		return getHardwareNumberOfRegisters(Vec);
}		}

▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AMDGPU/extractelement.ll

; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa %s \| FileCheck -check-prefixes=GCN,CI %s		; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa %s \| FileCheck -check-prefixes=GCN,CI %s
; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=fiji %s \| FileCheck -check-prefixes=GCN,VI %s		; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=fiji %s \| FileCheck -check-prefixes=GCN,GFX89 %s
; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 %s \| FileCheck -check-prefixes=GCN,GFX9 %s		; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 %s \| FileCheck -check-prefixes=GCN,GFX89 %s
		; RUN: opt -cost-model -cost-kind=code-size -analyze -mtriple=amdgcn-unknown-amdhsa %s \| FileCheck -check-prefixes=GCN,CI %s
		; RUN: opt -cost-model -cost-kind=code-size -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=fiji %s \| FileCheck -check-prefixes=GCN,GFX89 %s
		; RUN: opt -cost-model -cost-kind=code-size -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 %s \| FileCheck -check-prefixes=GCN,GFX89 %s


; GCN: 'extractelement_v2i32'		; GCN: 'extractelement_v2i32'
; GCN: estimated cost of 0 for {{.*}} extractelement <2 x i32>		; GCN: estimated cost of 0 for {{.*}} extractelement <2 x i32>
define amdgpu_kernel void @extractelement_v2i32(i32 addrspace(1)* %out, <2 x i32> addrspace(1)* %vaddr) {		define amdgpu_kernel void @extractelement_v2i32(i32 addrspace(1)* %out, <2 x i32> addrspace(1)* %vaddr) {
%vec = load <2 x i32>, <2 x i32> addrspace(1)* %vaddr		%vec = load <2 x i32>, <2 x i32> addrspace(1)* %vaddr
%elt = extractelement <2 x i32> %vec, i32 1		%elt = extractelement <2 x i32> %vec, i32 1
store i32 %elt, i32 addrspace(1)* %out		store i32 %elt, i32 addrspace(1)* %out
ret void		ret void
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @extractelement_v4i8(i8 addrspace(1)* %out, <4 x i8> addrspace(1)* %vaddr) {
%vec = load <4 x i8>, <4 x i8> addrspace(1)* %vaddr		%vec = load <4 x i8>, <4 x i8> addrspace(1)* %vaddr
%elt = extractelement <4 x i8> %vec, i8 1		%elt = extractelement <4 x i8> %vec, i8 1
store i8 %elt, i8 addrspace(1)* %out		store i8 %elt, i8 addrspace(1)* %out
ret void		ret void
}		}

; GCN: 'extractelement_0_v2i16':		; GCN: 'extractelement_0_v2i16':
; CI: estimated cost of 1 for {{.*}} extractelement <2 x i16> %vec, i16 0		; CI: estimated cost of 1 for {{.*}} extractelement <2 x i16> %vec, i16 0
; VI: estimated cost of 0 for {{.*}} extractelement <2 x i16>		; GFX89: estimated cost of 0 for {{.*}} extractelement <2 x i16>
; GFX9: estimated cost of 0 for {{.*}} extractelement <2 x i16>
define amdgpu_kernel void @extractelement_0_v2i16(i16 addrspace(1)* %out, <2 x i16> addrspace(1)* %vaddr) {		define amdgpu_kernel void @extractelement_0_v2i16(i16 addrspace(1)* %out, <2 x i16> addrspace(1)* %vaddr) {
%vec = load <2 x i16>, <2 x i16> addrspace(1)* %vaddr		%vec = load <2 x i16>, <2 x i16> addrspace(1)* %vaddr
%elt = extractelement <2 x i16> %vec, i16 0		%elt = extractelement <2 x i16> %vec, i16 0
store i16 %elt, i16 addrspace(1)* %out		store i16 %elt, i16 addrspace(1)* %out
ret void		ret void
}		}

; GCN: 'extractelement_1_v2i16':		; GCN: 'extractelement_1_v2i16':
Show All 16 Lines

llvm/test/Analysis/CostModel/AMDGPU/fabs.ll

	; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa < %s \| FileCheck %s			; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa < %s \| FileCheck %s
				; RUN: opt -cost-model -cost-kind=code-size -analyze -mtriple=amdgcn-unknown-amdhsa < %s \| FileCheck %s

	; CHECK: 'fabs_f32'			; CHECK-LABEL: 'fabs_f32'
	; CHECK: estimated cost of 0 for {{.*}} call float @llvm.fabs.f32			; CHECK: estimated cost of 0 for {{.*}} call float @llvm.fabs.f32
	define amdgpu_kernel void @fabs_f32(float addrspace(1)* %out, float addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_f32(float addrspace(1)* %out, float addrspace(1)* %vaddr) #0 {
	%vec = load float, float addrspace(1)* %vaddr			%vec = load float, float addrspace(1)* %vaddr
	%fabs = call float @llvm.fabs.f32(float %vec) #1			%fabs = call float @llvm.fabs.f32(float %vec) #1
	store float %fabs, float addrspace(1)* %out			store float %fabs, float addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_v2f32'			; CHECK-LABEL: 'fabs_v2f32'
	; CHECK: estimated cost of 0 for {{.*}} call <2 x float> @llvm.fabs.v2f32			; CHECK: estimated cost of 0 for {{.*}} call <2 x float> @llvm.fabs.v2f32
	define amdgpu_kernel void @fabs_v2f32(<2 x float> addrspace(1)* %out, <2 x float> addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_v2f32(<2 x float> addrspace(1)* %out, <2 x float> addrspace(1)* %vaddr) #0 {
	%vec = load <2 x float>, <2 x float> addrspace(1)* %vaddr			%vec = load <2 x float>, <2 x float> addrspace(1)* %vaddr
	%fabs = call <2 x float> @llvm.fabs.v2f32(<2 x float> %vec) #1			%fabs = call <2 x float> @llvm.fabs.v2f32(<2 x float> %vec) #1
	store <2 x float> %fabs, <2 x float> addrspace(1)* %out			store <2 x float> %fabs, <2 x float> addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_v3f32'			; CHECK-LABEL: 'fabs_v3f32'
	; CHECK: estimated cost of 0 for {{.*}} call <3 x float> @llvm.fabs.v3f32			; CHECK: estimated cost of 0 for {{.*}} call <3 x float> @llvm.fabs.v3f32
	define amdgpu_kernel void @fabs_v3f32(<3 x float> addrspace(1)* %out, <3 x float> addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_v3f32(<3 x float> addrspace(1)* %out, <3 x float> addrspace(1)* %vaddr) #0 {
	%vec = load <3 x float>, <3 x float> addrspace(1)* %vaddr			%vec = load <3 x float>, <3 x float> addrspace(1)* %vaddr
	%fabs = call <3 x float> @llvm.fabs.v3f32(<3 x float> %vec) #1			%fabs = call <3 x float> @llvm.fabs.v3f32(<3 x float> %vec) #1
	store <3 x float> %fabs, <3 x float> addrspace(1)* %out			store <3 x float> %fabs, <3 x float> addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_v5f32'			; CHECK-LABEL: 'fabs_v5f32'
	; CHECK: estimated cost of 0 for {{.*}} call <5 x float> @llvm.fabs.v5f32			; CHECK: estimated cost of 0 for {{.*}} call <5 x float> @llvm.fabs.v5f32
	define amdgpu_kernel void @fabs_v5f32(<5 x float> addrspace(1)* %out, <5 x float> addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_v5f32(<5 x float> addrspace(1)* %out, <5 x float> addrspace(1)* %vaddr) #0 {
	%vec = load <5 x float>, <5 x float> addrspace(1)* %vaddr			%vec = load <5 x float>, <5 x float> addrspace(1)* %vaddr
	%fabs = call <5 x float> @llvm.fabs.v5f32(<5 x float> %vec) #1			%fabs = call <5 x float> @llvm.fabs.v5f32(<5 x float> %vec) #1
	store <5 x float> %fabs, <5 x float> addrspace(1)* %out			store <5 x float> %fabs, <5 x float> addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_f64'			; CHECK-LABEL: 'fabs_f64'
	; CHECK: estimated cost of 0 for {{.*}} call double @llvm.fabs.f64			; CHECK: estimated cost of 0 for {{.*}} call double @llvm.fabs.f64
	define amdgpu_kernel void @fabs_f64(double addrspace(1)* %out, double addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_f64(double addrspace(1)* %out, double addrspace(1)* %vaddr) #0 {
	%vec = load double, double addrspace(1)* %vaddr			%vec = load double, double addrspace(1)* %vaddr
	%fabs = call double @llvm.fabs.f64(double %vec) #1			%fabs = call double @llvm.fabs.f64(double %vec) #1
	store double %fabs, double addrspace(1)* %out			store double %fabs, double addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_v2f64'			; CHECK-LABEL: 'fabs_v2f64'
	; CHECK: estimated cost of 0 for {{.*}} call <2 x double> @llvm.fabs.v2f64			; CHECK: estimated cost of 0 for {{.*}} call <2 x double> @llvm.fabs.v2f64
	define amdgpu_kernel void @fabs_v2f64(<2 x double> addrspace(1)* %out, <2 x double> addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_v2f64(<2 x double> addrspace(1)* %out, <2 x double> addrspace(1)* %vaddr) #0 {
	%vec = load <2 x double>, <2 x double> addrspace(1)* %vaddr			%vec = load <2 x double>, <2 x double> addrspace(1)* %vaddr
	%fabs = call <2 x double> @llvm.fabs.v2f64(<2 x double> %vec) #1			%fabs = call <2 x double> @llvm.fabs.v2f64(<2 x double> %vec) #1
	store <2 x double> %fabs, <2 x double> addrspace(1)* %out			store <2 x double> %fabs, <2 x double> addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_v3f64'			; CHECK-LABEL: 'fabs_v3f64'
	; CHECK: estimated cost of 0 for {{.*}} call <3 x double> @llvm.fabs.v3f64			; CHECK: estimated cost of 0 for {{.*}} call <3 x double> @llvm.fabs.v3f64
	define amdgpu_kernel void @fabs_v3f64(<3 x double> addrspace(1)* %out, <3 x double> addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_v3f64(<3 x double> addrspace(1)* %out, <3 x double> addrspace(1)* %vaddr) #0 {
	%vec = load <3 x double>, <3 x double> addrspace(1)* %vaddr			%vec = load <3 x double>, <3 x double> addrspace(1)* %vaddr
	%fabs = call <3 x double> @llvm.fabs.v3f64(<3 x double> %vec) #1			%fabs = call <3 x double> @llvm.fabs.v3f64(<3 x double> %vec) #1
	store <3 x double> %fabs, <3 x double> addrspace(1)* %out			store <3 x double> %fabs, <3 x double> addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_f16'			; CHECK-LABEL: 'fabs_f16'
	; CHECK: estimated cost of 0 for {{.*}} call half @llvm.fabs.f16			; CHECK: estimated cost of 0 for {{.*}} call half @llvm.fabs.f16
	define amdgpu_kernel void @fabs_f16(half addrspace(1)* %out, half addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_f16(half addrspace(1)* %out, half addrspace(1)* %vaddr) #0 {
	%vec = load half, half addrspace(1)* %vaddr			%vec = load half, half addrspace(1)* %vaddr
	%fabs = call half @llvm.fabs.f16(half %vec) #1			%fabs = call half @llvm.fabs.f16(half %vec) #1
	store half %fabs, half addrspace(1)* %out			store half %fabs, half addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_v2f16'			; CHECK-LABEL: 'fabs_v2f16'
	; CHECK: estimated cost of 0 for {{.*}} call <2 x half> @llvm.fabs.v2f16			; CHECK: estimated cost of 0 for {{.*}} call <2 x half> @llvm.fabs.v2f16
	define amdgpu_kernel void @fabs_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_v2f16(<2 x half> addrspace(1)* %out, <2 x half> addrspace(1)* %vaddr) #0 {
	%vec = load <2 x half>, <2 x half> addrspace(1)* %vaddr			%vec = load <2 x half>, <2 x half> addrspace(1)* %vaddr
	%fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %vec) #1			%fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %vec) #1
	store <2 x half> %fabs, <2 x half> addrspace(1)* %out			store <2 x half> %fabs, <2 x half> addrspace(1)* %out
	ret void			ret void
	}			}

	; CHECK: 'fabs_v3f16'			; CHECK-LABEL: 'fabs_v3f16'
	; CHECK: estimated cost of 0 for {{.*}} call <3 x half> @llvm.fabs.v3f16			; CHECK: estimated cost of 0 for {{.*}} call <3 x half> @llvm.fabs.v3f16
	define amdgpu_kernel void @fabs_v3f16(<3 x half> addrspace(1)* %out, <3 x half> addrspace(1)* %vaddr) #0 {			define amdgpu_kernel void @fabs_v3f16(<3 x half> addrspace(1)* %out, <3 x half> addrspace(1)* %vaddr) #0 {
	%vec = load <3 x half>, <3 x half> addrspace(1)* %vaddr			%vec = load <3 x half>, <3 x half> addrspace(1)* %vaddr
	%fabs = call <3 x half> @llvm.fabs.v3f16(<3 x half> %vec) #1			%fabs = call <3 x half> @llvm.fabs.v3f16(<3 x half> %vec) #1
	store <3 x half> %fabs, <3 x half> addrspace(1)* %out			store <3 x half> %fabs, <3 x half> addrspace(1)* %out
	ret void			ret void
	}			}

	Show All 15 Lines

llvm/test/Analysis/CostModel/AMDGPU/insertelement.ll

	; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa %s \| FileCheck -check-prefixes=GCN,CI %s			; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa %s \| FileCheck -check-prefixes=GCN,CI %s
	; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=fiji %s \| FileCheck -check-prefixes=GCN,VI %s			; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=fiji %s \| FileCheck -check-prefixes=GCN,GFX89 %s
	; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 %s \| FileCheck -check-prefixes=GCN,GFX9 %s			; RUN: opt -cost-model -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 %s \| FileCheck -check-prefixes=GCN,GFX89 %s
				; RUN: opt -cost-model -cost-kind=code-size -analyze -mtriple=amdgcn-unknown-amdhsa %s \| FileCheck -check-prefixes=GCN,CI %s
				; RUN: opt -cost-model -cost-kind=code-size -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=fiji %s \| FileCheck -check-prefixes=GCN,GFX89 %s
				; RUN: opt -cost-model -cost-kind=code-size -analyze -mtriple=amdgcn-unknown-amdhsa -mcpu=gfx900 %s \| FileCheck -check-prefixes=GCN,GFX89 %s

	; GCN-LABEL: 'insertelement_v2i32'			; GCN-LABEL: 'insertelement_v2i32'
	; GCN: estimated cost of 0 for {{.*}} insertelement <2 x i32>			; GCN: estimated cost of 0 for {{.*}} insertelement <2 x i32>
	define amdgpu_kernel void @insertelement_v2i32(<2 x i32> addrspace(1)* %out, <2 x i32> addrspace(1)* %vaddr) {			define amdgpu_kernel void @insertelement_v2i32(<2 x i32> addrspace(1)* %out, <2 x i32> addrspace(1)* %vaddr) {
	%vec = load <2 x i32>, <2 x i32> addrspace(1)* %vaddr			%vec = load <2 x i32>, <2 x i32> addrspace(1)* %vaddr
	%insert = insertelement <2 x i32> %vec, i32 123, i32 1			%insert = insertelement <2 x i32> %vec, i32 123, i32 1
	store <2 x i32> %insert, <2 x i32> addrspace(1)* %out			store <2 x i32> %insert, <2 x i32> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: 'insertelement_v2i64'			; GCN-LABEL: 'insertelement_v2i64'
	; GCN: estimated cost of 0 for {{.*}} insertelement <2 x i64>			; GCN: estimated cost of 0 for {{.*}} insertelement <2 x i64>
	define amdgpu_kernel void @insertelement_v2i64(<2 x i64> addrspace(1)* %out, <2 x i64> addrspace(1)* %vaddr) {			define amdgpu_kernel void @insertelement_v2i64(<2 x i64> addrspace(1)* %out, <2 x i64> addrspace(1)* %vaddr) {
	%vec = load <2 x i64>, <2 x i64> addrspace(1)* %vaddr			%vec = load <2 x i64>, <2 x i64> addrspace(1)* %vaddr
	%insert = insertelement <2 x i64> %vec, i64 123, i64 1			%insert = insertelement <2 x i64> %vec, i64 123, i64 1
	store <2 x i64> %insert, <2 x i64> addrspace(1)* %out			store <2 x i64> %insert, <2 x i64> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: 'insertelement_0_v2i16'			; GCN-LABEL: 'insertelement_0_v2i16'
	; CI: estimated cost of 1 for {{.*}} insertelement <2 x i16>			; CI: estimated cost of 1 for {{.*}} insertelement <2 x i16>
	; VI: estimated cost of 0 for {{.*}} insertelement <2 x i16>			; GFX89: estimated cost of 0 for {{.*}} insertelement <2 x i16>
	; GFX9: estimated cost of 0 for {{.*}} insertelement <2 x i16>
	define amdgpu_kernel void @insertelement_0_v2i16(<2 x i16> addrspace(1)* %out, <2 x i16> addrspace(1)* %vaddr) {			define amdgpu_kernel void @insertelement_0_v2i16(<2 x i16> addrspace(1)* %out, <2 x i16> addrspace(1)* %vaddr) {
	%vec = load <2 x i16>, <2 x i16> addrspace(1)* %vaddr			%vec = load <2 x i16>, <2 x i16> addrspace(1)* %vaddr
	%insert = insertelement <2 x i16> %vec, i16 123, i16 0			%insert = insertelement <2 x i16> %vec, i16 123, i16 0
	store <2 x i16> %insert, <2 x i16> addrspace(1)* %out			store <2 x i16> %insert, <2 x i16> addrspace(1)* %out
	ret void			ret void
	}			}

	; GCN-LABEL: 'insertelement_1_v2i16'			; GCN-LABEL: 'insertelement_1_v2i16'
	Show All 16 Lines