This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64TargetTransformInfo.h
1
AArch64TargetTransformInfo.cpp
-
test/Analysis/CostModel/AArch64/
-
Analysis/
-
CostModel/
-
AArch64/
-
insert-extract.ll

Differential D141602

[TTI][AArch64] Cost model insertelement and indexed LD1 instructions
ClosedPublic

Authored by SjoerdMeijer on Jan 12 2023, 4:54 AM.

Download Raw Diff

Details

Reviewers

dmgreen
david-arm
samtebbs
• zino
paulwalker-arm

Commits

rG079c488c6605: [TTI][AArch64] Cost model insertelement and indexed LD1 instructions

Summary

The insertelement IR instruction can lead to different codegen, there are quite a few variants available/applicable. One option is to generate an INS, which is "ASIMD insert, element to element" instruction. This is actually a cheap instructions as it only has a latency of 2 on modern cores like the N1, N2 and V1. Currently we model this with a cost of 3, which perhaps is slightly higher than needed, but that is for another time

This is about another variant, an indexed LD1, or "ASIMD load, 1 element, one lane, B/H/S" instruction, that loads a value and inserts an element into a vector. This is actually an expensive instruction, which has a latency of 8 on modern cores. We generate an indexed LD1 when an insertelement instruction has a load as an operand. And this patch is recognising that, assigning a cost of 4 to this type of insertelement instructions making it a bit more expensive than the 3 it was before. This new cost of 4 is fairly arbitrary, but the point is that it makes it more expensive.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Jan 12 2023, 4:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:54 AM

Herald added subscribers: arphaman, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Jan 12 2023, 4:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:54 AM

Harbormaster completed remote builds in B207369: Diff 488605.Jan 12 2023, 4:55 AM

Do you have any performance results?

You are wanting to increase the cost? I had actually tried to do the opposite a little while ago, as there are some routines that would benefit from doing single lane load/inserts, so that the rest of the routine could vectorize nicely. It is one instruction after all, but I couldn't justify it in general because of the high cost.
The cost these routines measure reciprocal throughput by default, not latency. The operations seem to take two pipelines though, so a high cost would not be unreasonable.

Good questions! Let me provide a bit more context, which I indeed didn't provide.

I am looking at a few cases where the SLP vectoriser is too smart for its own good. I.e., the SLP vectoriser kicks in generating vector code, but it tanks performance. It's exactly this what you mentioned:

single lane load/inserts, so that the rest of the routine could vectorize nicely.

It enables vectorisation, and it looks well vectorised code, but it executes slower than scalar code. There are 2 problems I have seen so far. First, some instructions are expensive, this indexed LD1 is an example of that. And please note not the INS, as I mentioned in the description, I agree that the cost of that thing can probably be lowered. Second, the SLP vectorised code causes "dependency chains" resulting in a lot of backend stalls.

A recent raised SLP performance bug that is very similar to my problems is this one:

https://github.com/llvm/llvm-project/issues/59867

It's not exactly the same thing I was looking at, but very similar and a good proxy, I think. Perhaps I should raise a few reproducers as perf bugs upstream, but I have good hopes that solving that problem also solves my problems. So, summarising, I am on a little mission to rein the SLP vectoriser back in, and this is a bit of cost modelling potentially helping. It's not solving PR59867, that seems something going wrong in the SLP vectoriser.

About performance results: hand written micro benchmark clearly show the benefit of not generating this particular LD1 variant. I ran some benchmarking that was neutral, but not sure this variant was generated or performance critical. So I will see if I can do a bit more. From that point this just seemed to be the right thing to do, because it is an expensive instructions.

dtemirbulatov added a subscriber: dtemirbulatov.Jan 13 2023, 3:49 AM

I have in the past considered the cost from getVectorInstrCost to be less about the exact cost of moving into vector lanes, and more about how much sillyness you are willing to put up with from the SLP vectorizer.

If you have test cases that could be added for the SLP vectorizer, that would be useful too.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2222	It might be worth using `ST->getVectorInsertExtractBaseCost() + 1`

I am struggling with my motivating example for the SLP vectoriser. There are multiple things going on, and this doesn't seem to make a difference.
So for now, this "looks more correct" to me as index LD1s are expensive. If you're not convinced, I am happy to park this for now.

Harbormaster completed remote builds in B208782: Diff 490552.Jan 19 2023, 10:07 AM

Im happy for the change to go in, I think. It changes the total cost of load+insert from 1+3 to 1+4 which isnt a huge increase. It sounds OK to me, although it was already fairly high before and there is a chance we might want to adjust it in the future if we have good reason.

Can you change it to ST->getVectorInsertExtractBaseCost() + 1?

In D141602#4072537, @dmgreen wrote:

Im happy for the change to go in, I think. It changes the total cost of load+insert from 1+3 to 1+4 which isnt a huge increase. It sounds OK to me, although it was already fairly high before and there is a chance we might want to adjust it in the future if we have good reason.

Ok, cheers, sounds good.
I will follow up today/tomorrow as I think we can lower the cost for a "MOV v1[1], v2[1]" type of element insertion, as they are actually cheap instructions on modern cores. I think that would reflect things a lot better; it would make the indexed LD1 (cost 4) two times more expensive than an MOV/INS (cost 2, which is 3 at the moment). But I will run SPEC for this today to see if I am not missing anything.

Can you change it to ST->getVectorInsertExtractBaseCost() + 1?

Ah, sorry, I thought I did, I must have uploaded a different/wrong diff.

This time with ST->getVectorInsertExtractBaseCost() + 1

Harbormaster completed remote builds in B209326: Diff 491298.Jan 23 2023, 4:07 AM

FYI: D142359, for the "ASIMD insert, element to element" instruction.

SjoerdMeijer mentioned this in D142359: [TTI][AArch64] Cost model vector INS instructions.Jan 23 2023, 7:04 AM

Matt added a subscriber: Matt.Jan 23 2023, 12:57 PM

LGTM, thanks.

This revision is now accepted and ready to land.Jan 24 2023, 7:43 AM

This revision was landed with ongoing or failed builds.Feb 9 2023, 8:28 AM

Closed by commit rG079c488c6605: [TTI][AArch64] Cost model insertelement and indexed LD1 instructions (authored by SjoerdMeijer). · Explain Why

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG079c488c6605: [TTI][AArch64] Cost model insertelement and indexed LD1 instructions.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

4 lines

AArch64TargetTransformInfo.cpp

20 lines

test/

Analysis/

CostModel/

AArch64/

insert-extract.ll

16 lines

Diff 496145

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
bool isWideningInstruction(Type *Ty, unsigned Opcode,		bool isWideningInstruction(Type *Ty, unsigned Opcode,
ArrayRef<const Value *> Args);		ArrayRef<const Value *> Args);

// A helper function called by 'getVectorInstrCost'.		// A helper function called by 'getVectorInstrCost'.
//		//
// 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'		// 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
// indicates whether the vector instruction is available in the input IR or		// indicates whether the vector instruction is available in the input IR or
// just imaginary in vectorizer passes.		// just imaginary in vectorizer passes.
InstructionCost getVectorInstrCostHelper(Type *Val, unsigned Index,		InstructionCost getVectorInstrCostHelper(const Instruction I, Type Val,
bool HasRealUse);		unsigned Index, bool HasRealUse);

public:		public:
explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)		explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),		: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),
TLI(ST->getTargetLowering()) {}		TLI(ST->getTargetLowering()) {}

bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;
▲ Show 20 Lines • Show All 322 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,178 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
const Instruction *I) {		const Instruction *I) {
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Opcode == Instruction::PHI ? 0 : 1;		return Opcode == Instruction::PHI ? 0 : 1;
assert(CostKind == TTI::TCK_RecipThroughput && "unexpected CostKind");		assert(CostKind == TTI::TCK_RecipThroughput && "unexpected CostKind");
// Branches are assumed to be predicted.		// Branches are assumed to be predicted.
return 0;		return 0;
}		}

InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(Type *Val,		InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
		Type *Val,
unsigned Index,		unsigned Index,
bool HasRealUse) {		bool HasRealUse) {
assert(Val->isVectorTy() && "This must be a vector type");		assert(Val->isVectorTy() && "This must be a vector type");

if (Index != -1U) {		if (Index != -1U) {
// Legalize the type.		// Legalize the type.
std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);		std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);

Show All 9 Lines	if (Index != -1U) {
}		}

// The element at index zero is already inside the vector.		// The element at index zero is already inside the vector.
// - For a physical (HasRealUse==true) insert-element or extract-element		// - For a physical (HasRealUse==true) insert-element or extract-element
// instruction that extracts integers, an explicit FPR -> GPR move is		// instruction that extracts integers, an explicit FPR -> GPR move is
// needed. So it has non-zero cost.		// needed. So it has non-zero cost.
// - For the rest of cases (virtual instruction or element type is float),		// - For the rest of cases (virtual instruction or element type is float),
// consider the instruction free.		// consider the instruction free.
//		if (Index == 0 && (!HasRealUse \|\| !Val->getScalarType()->isIntegerTy()))
		return 0;

		// This is recognising a LD1 single-element structure to one lane of one
		// register instruction. I.e., if this is an `insertelement` instruction,
		// and its second operand is a load, then we will generate a LD1, which
		// are expensive instructions.
		if (I && dyn_cast<LoadInst>(I->getOperand(1)))
		return ST->getVectorInsertExtractBaseCost() + 1;
		dmgreenUnsubmitted Not Done Reply Inline Actions It might be worth using `ST->getVectorInsertExtractBaseCost() + 1` dmgreen: It might be worth using `ST->getVectorInsertExtractBaseCost() + 1`

// FIXME:		// FIXME:
// If the extract-element and insert-element instructions could be		// If the extract-element and insert-element instructions could be
// simplified away (e.g., could be combined into users by looking at use-def		// simplified away (e.g., could be combined into users by looking at use-def
// context), they have no cost. This is not done in the first place for		// context), they have no cost. This is not done in the first place for
// compile-time considerations.		// compile-time considerations.
if (Index == 0 && (!HasRealUse \|\| !Val->getScalarType()->isIntegerTy()))
return 0;
}		}

// All other insert/extracts cost this much.		// All other insert/extracts cost this much.
return ST->getVectorInsertExtractBaseCost();		return ST->getVectorInsertExtractBaseCost();
}		}

InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,		InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
unsigned Index, Value *Op0,		unsigned Index, Value *Op0,
Value *Op1) {		Value *Op1) {
return getVectorInstrCostHelper(Val, Index, false /* HasRealUse */);		return getVectorInstrCostHelper(nullptr, Val, Index, false /* HasRealUse */);
}		}

InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,		InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
Type *Val,		Type *Val,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
unsigned Index) {		unsigned Index) {
return getVectorInstrCostHelper(Val, Index, true /* HasRealUse */);		return getVectorInstrCostHelper(&I, Val, Index, true /* HasRealUse */);
}		}

InstructionCost AArch64TTIImpl::getArithmeticInstrCost(		InstructionCost AArch64TTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,		unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,		TTI::OperandValueInfo Op1Info, TTI::OperandValueInfo Op2Info,
ArrayRef<const Value *> Args,		ArrayRef<const Value *> Args,
const Instruction *CxtI) {		const Instruction *CxtI) {

▲ Show 20 Lines • Show All 1,159 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/insert-extract.ll

Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	;
ret void		ret void
}		}

;; LD1: Load one single-element structure to one lane of one register.		;; LD1: Load one single-element structure to one lane of one register.

define <8 x i8> @LD1_B(<8 x i8> %vec, ptr noundef %i) {		define <8 x i8> @LD1_B(<8 x i8> %vec, ptr noundef %i) {
; KRYO-LABEL: 'LD1_B'		; KRYO-LABEL: 'LD1_B'
; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1		; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1
; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1		; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1
; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2		; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2
;		;
; NEO-LABEL: 'LD1_B'		; NEO-LABEL: 'LD1_B'
; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1		; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i8, ptr %i, align 1
; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1		; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1
; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2		; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <8 x i8> %v2
;		;
entry:		entry:
%v1 = load i8, ptr %i, align 1		%v1 = load i8, ptr %i, align 1
%v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1		%v2 = insertelement <8 x i8> %vec, i8 %v1, i32 1
ret <8x i8> %v2		ret <8x i8> %v2
}		}

define <4 x i16> @LD1_H(<4 x i16> %vec, ptr noundef %i) {		define <4 x i16> @LD1_H(<4 x i16> %vec, ptr noundef %i) {
; KRYO-LABEL: 'LD1_H'		; KRYO-LABEL: 'LD1_H'
; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2		; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2
; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2		; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2
; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2		; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2
;		;
; NEO-LABEL: 'LD1_H'		; NEO-LABEL: 'LD1_H'
; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2		; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i16, ptr %i, align 2
; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2		; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2
; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2		; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i16> %v2
;		;
entry:		entry:
%v1 = load i16, ptr %i, align 2		%v1 = load i16, ptr %i, align 2
%v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2		%v2 = insertelement <4 x i16> %vec, i16 %v1, i32 2
ret <4 x i16> %v2		ret <4 x i16> %v2
}		}

define <4 x i32> @LD1_W(<4 x i32> %vec, ptr noundef %i) {		define <4 x i32> @LD1_W(<4 x i32> %vec, ptr noundef %i) {
; KRYO-LABEL: 'LD1_W'		; KRYO-LABEL: 'LD1_W'
; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4		; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4
; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3		; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3
; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2		; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2
;		;
; NEO-LABEL: 'LD1_W'		; NEO-LABEL: 'LD1_W'
; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4		; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i32, ptr %i, align 4
; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3		; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3
; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2		; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %v2
;		;
entry:		entry:
%v1 = load i32, ptr %i, align 4		%v1 = load i32, ptr %i, align 4
%v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3		%v2 = insertelement <4 x i32> %vec, i32 %v1, i32 3
ret <4 x i32> %v2		ret <4 x i32> %v2
}		}

define <2 x i64> @LD1_X(<2 x i64> %vec, ptr noundef %i) {		define <2 x i64> @LD1_X(<2 x i64> %vec, ptr noundef %i) {
; KRYO-LABEL: 'LD1_X'		; KRYO-LABEL: 'LD1_X'
; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8		; KRYO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8
; KRYO-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0		; KRYO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0
; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2		; KRYO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2
;		;
; NEO-LABEL: 'LD1_X'		; NEO-LABEL: 'LD1_X'
; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8		; NEO-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %v1 = load i64, ptr %i, align 8
; NEO-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0		; NEO-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0
; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2		; NEO-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret <2 x i64> %v2
;		;
entry:		entry:
%v1 = load i64, ptr %i, align 8		%v1 = load i64, ptr %i, align 8
%v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0		%v2 = insertelement <2 x i64> %vec, i64 %v1, i32 0
ret <2 x i64> %v2		ret <2 x i64> %v2
}		}