This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
-
PPC.td
-
PPCSubtarget.h
-
PPCSubtarget.cpp
-
PPCTargetTransformInfo.h
-
PPCTargetTransformInfo.cpp
-
test/
-
Analysis/CostModel/PowerPC/
-
CostModel/
-
PowerPC/
-
p9.ll
-
Transforms/SLPVectorizer/PowerPC/
-
SLPVectorizer/
-
PowerPC/
-
short-to-double.ll

Differential D55461

[PowerPC] Update Vector Costs for P9
ClosedPublic

Authored by RolandF on Dec 7 2018, 4:02 PM.

Download Raw Diff

Details

Reviewers

nemanjai
jsji
hfinkel

Commits

rG7d007ddedfd6: [PowerPC] Update Vector Costs for P9
rL352261: [PowerPC] Update Vector Costs for P9

Summary

For the power9 CPU, vector operations consume a pair of execution units rather than one execution unit like a scalar operation. Update the target transform cost functions to reflect the higher cost of vector operations when targeting power9.

Diff Detail

Repository: rL LLVM

Event Timeline

RolandF created this revision.Dec 7 2018, 4:02 PM

Herald added subscribers: kbarton, hiraditya. · View Herald TranscriptDec 7 2018, 4:02 PM

I am confused by this patch.

In LoopVectorize.cpp, getInstructionCost returns the execution time cost of an instruction for a given vector.

So I think the cost model here should be related to latency.

Why we need to take into consideration of execution units for execution time cost ?
Considering execution units looks more like throughput cost model?

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
337 ↗	(On Diff #177338)	A `isVectorTy` here will not always be legal vector type, and maybe expanded or scalarized. In BaseT:: getArithmeticInstrCost we may calculate the cost depends on the legalization. So I think we might have problem here if we simply check isVectorTy to increase the cost.

jedilyn added a subscriber: jedilyn.Dec 14 2018, 12:47 AM

In D55461#1326192, @jsji wrote:

I am confused by this patch.

In LoopVectorize.cpp, getInstructionCost returns the execution time cost of an instruction for a given vector.

So I think the cost model here should be related to latency.

Why we need to take into consideration of execution units for execution time cost ?
Considering execution units looks more like throughput cost model?

The calculation the loop vectorizer is making is comparing vector cost / vector factor to scalar cost. The calculation the SLP vectorizer is making is comparing vector cost to scalar cost * vector factor. In either case the vector factor is involved, and once the vector factor is involved it is really a throughput calculation and not a latency calculation. I don't think vectorization can make sense as a latency calculation - the scalar ops are always going to be at least as cheap. The win is in doing more at the same time, not in delivering a particular result faster.

jsji added a reviewer: hfinkel.Dec 18 2018, 3:22 PM

In D55461#1334814, @RolandF wrote:

The calculation the loop vectorizer is making is comparing vector cost / vector factor to scalar cost. The calculation the SLP vectorizer is making is comparing vector cost to scalar cost * vector factor. In either case the vector factor is involved, and once the vector factor is involved it is really a throughput calculation and not a latency calculation. I don't think vectorization can make sense as a latency calculation - the scalar ops are always going to be at least as cheap. The win is in doing more at the same time, not in delivering a particular result faster.

Thanks for explaining. I am not familiar with SLP enough, maybe @hfinkel can help to see whether it is reasonable to consider execution unit num in vectorizer's cost model. Thanks!

In D55461#1335427, @jsji wrote:

In D55461#1334814, @RolandF wrote:

The calculation the loop vectorizer is making is comparing vector cost / vector factor to scalar cost. The calculation the SLP vectorizer is making is comparing vector cost to scalar cost * vector factor. In either case the vector factor is involved, and once the vector factor is involved it is really a throughput calculation and not a latency calculation. I don't think vectorization can make sense as a latency calculation - the scalar ops are always going to be at least as cheap. The win is in doing more at the same time, not in delivering a particular result faster.

Thanks for explaining. I am not familiar with SLP enough, maybe @hfinkel can help to see whether it is reasonable to consider execution unit num in vectorizer's cost model. Thanks!

The vectorization cost models do measure throughput. If a single operation consumes multiple execution units, that presumably is a throughput effect. We do have a latency model, but not much uses it yet (in theory, we'd use it during vectorization to determine interleaving factors, but we don't currently do that).

Also, cost model changes can be tested directly. See tests in test/Analysis/CostModel/PowerPC/

I think this code could be greatly simplified if we implement int PPCTTIImpl::vectorCostAdjustmentFactor(Type *Ty) which would return 1 by default and 2 if we are on a subtarget with vector operations that take up two execution units and Ty is a vector type.

Also, as Hal pointed out please add cost model tests for the affected costs.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
331 ↗	(On Diff #177338)	I don't think we want to change this. For builds with asserts compiled out, this will emit "variable defined but not used" warnings.
343 ↗	(On Diff #177338)	It seems odd to me that we want to change the cost here since a shuffle won't have a scalar equivalent. So the fact that a vector operation uses two functional units shouldn't really matter. The vectorizer won't insert a shuffle on its own. It seems to me that the added cost of being in the vector domain should already be reflected by the cost of the other operations.
410 ↗	(On Diff #177338)	I think it would be clearer if this comment appeared before the condition.

Also, I've thought a bit about what Jinsong mentioned and I am wondering if maybe we don't want to take a different approach. Namely, the cost model is meant to return the cost of individual instructions with a given type and this cost is the instructions latency relative to the latency of a simple scalar ALU instruction (I think a comment somewhere uses add as the basis for what 1 means). And the vectorizers use these values to essentially add up the costs of all the instructions in a loop/block and then compute what provides the lowest total cost assuming that the cost of an N-wide vectorization can be computed by simply dividing the total cost by N.

So I would argue that it is that last assumption that isn't valid on the P9 CPU (and perhaps others out there as well). I'm wondering if it wouldn't be a better approach to define something like float TargetTransformInfo::vectorUnitThroughputReductionFactor() and using that in the cost computations. On P9, that would presumably be set to 2.0 since we can dispatch half as many vector instructions in each cycle as we can scalar ones.

Let me know what you think @jsji @hfinkel @mkuper @RolandF.

I don't think adding a new TTI function is necessary, as I think the way I am modeling the costs here is what is expected. The following comment (from TargetTransformInfo.h: getArithmeticInstrCost()) may shed some light:

/// This is an approximation of reciprocal throughput of a math/logic op.
/// A higher cost indicates less expected throughput.
/// From Agner Fog's guides, reciprocal throughput is "the average number of
/// clock cycles per instruction when the instructions are not part of a
/// limiting dependency chain."
/// Therefore, costs should be scaled to account for multiple execution units
/// on the target that can process this type of instruction. For example, if
/// there are 5 scalar integer units and 2 vector integer units that can
/// calculate an 'add' in a single cycle, this model should indicate that the
/// cost of the vector add instruction is 2.5 times the cost of the scalar
/// add instruction.

Thanks Roland! This is great comment from Sanjay! Also aligned with Hal's comments.
So now, we should all agree on that the cost model here is regarding throughput.

However, according to Sanjay's example, we should account for multiple execution units that can process this type of instruction.

So to my understanding, the cost * 2 is because we have 4 slices for scalar arithmetic, while 2 superslices for vector arithmetic, so 4/2 = 2.

But for other operations like shuffle, memory op, looks like we may need to calculate them one by one, not just always use 2?

For memory ops it should be the same as arithmetic. The LSUs are a separate resource from the slices, but a vector load or store still consumes multiple LSUs (2x if aligned, 3x if not). I don't follow why there should be a problem with shuffle - I assume a shuffle will require one or more vector ALU ops.

Added a new TTI method vectorCostAdjustment to consolidate and make uniform for all instruction types the checks and cost modification. Also added direct cost model test.

RolandF marked 3 inline comments as done.Jan 17 2019, 10:12 AM

RolandF marked an inline comment as done.Jan 17 2019, 10:22 AM

RolandF added inline comments.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
343 ↗	(On Diff #177338)	SLP does create shuffles, and it asks about the cost of shuffles. It's really a total cost vs total cost comparison, not instruction by instruction, so I think it's appropriate to factor in the execution unit cost of additions.

Thank you for the updates. I just have a couple of minor nits to add comments which can be done on the commit. LGTM.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
333 ↗	(On Diff #182330)	// If type legalization involves splitting the vector, we don't want to // double the cost at every step - only the last step.
426 ↗	(On Diff #182330)	Please remove the code that's commented out.

This revision is now accepted and ready to land.Jan 24 2019, 10:03 AM

nemanjai added inline comments.Jan 24 2019, 10:06 AM

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
326 ↗	(On Diff #182330)	// Adjust the cost of vector instructions on targets on which there is overlap // between the vector and scalar units, thereby reducing the overall throughput // of vector code wrt. scalar code.

Deleted commented out code and added suggested comments.

@RolandF Do you need me to commit this for you have commit access now?

@nemanjai Yes, please, I do need it committed for me. Last time I promise!

Closed by commit rL352261: [PowerPC] Update Vector Costs for P9 (authored by nemanjai). · Explain WhyJan 25 2019, 5:18 PM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptJan 25 2019, 5:18 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

PowerPC/

PPC.td

10 lines

PPCSubtarget.h

2 lines

PPCSubtarget.cpp

1 line

PPCTargetTransformInfo.h

1 line

PPCTargetTransformInfo.cpp

57 lines

test/

Analysis/

CostModel/

PowerPC/

p9.ll

68 lines

Transforms/

SLPVectorizer/

PowerPC/

short-to-double.ll

39 lines

Diff 183667

llvm/trunk/lib/Target/PowerPC/PPC.td

Show First 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	def FeatureISA3_0 : SubtargetFeature<"isa-v30-instructions", "IsISA3_0",
"Enable instructions added in ISA 3.0.">;		"Enable instructions added in ISA 3.0.">;
def FeatureP9Altivec : SubtargetFeature<"power9-altivec", "HasP9Altivec", "true",		def FeatureP9Altivec : SubtargetFeature<"power9-altivec", "HasP9Altivec", "true",
"Enable POWER9 Altivec instructions",		"Enable POWER9 Altivec instructions",
[FeatureISA3_0, FeatureP8Altivec]>;		[FeatureISA3_0, FeatureP8Altivec]>;
def FeatureP9Vector : SubtargetFeature<"power9-vector", "HasP9Vector", "true",		def FeatureP9Vector : SubtargetFeature<"power9-vector", "HasP9Vector", "true",
"Enable POWER9 vector instructions",		"Enable POWER9 vector instructions",
[FeatureISA3_0, FeatureP8Vector,		[FeatureISA3_0, FeatureP8Vector,
FeatureP9Altivec]>;		FeatureP9Altivec]>;
		// A separate feature for this even though it is equivalent to P9Vector
		// because this is a feature of the implementation rather than the architecture
		// and may go away with future CPU's.
		def FeatureVectorsUseTwoUnits : SubtargetFeature<"vectors-use-two-units",
		"VectorsUseTwoUnits",
		"true",
		"Vectors use two units">;

// Since new processors generally contain a superset of features of those that		// Since new processors generally contain a superset of features of those that
// came before them, the idea is to make implementations of new processors		// came before them, the idea is to make implementations of new processors
// less error prone and easier to read.		// less error prone and easier to read.
// Namely:		// Namely:
// list<SubtargetFeature> Power8FeatureList = ...		// list<SubtargetFeature> Power8FeatureList = ...
// list<SubtargetFeature> FutureProcessorSpecificFeatureList =		// list<SubtargetFeature> FutureProcessorSpecificFeatureList =
// [ features that Power8 does not support ]		// [ features that Power8 does not support ]
Show All 16 Lines	list<SubtargetFeature> Power7FeatureList =
FeatureMFTB, DeprecatedDST];		FeatureMFTB, DeprecatedDST];
list<SubtargetFeature> Power8SpecificFeatures =		list<SubtargetFeature> Power8SpecificFeatures =
[DirectivePwr8, FeatureP8Altivec, FeatureP8Vector, FeatureP8Crypto,		[DirectivePwr8, FeatureP8Altivec, FeatureP8Vector, FeatureP8Crypto,
FeatureHTM, FeatureDirectMove, FeatureICBT, FeaturePartwordAtomic,		FeatureHTM, FeatureDirectMove, FeatureICBT, FeaturePartwordAtomic,
FeatureFusion];		FeatureFusion];
list<SubtargetFeature> Power8FeatureList =		list<SubtargetFeature> Power8FeatureList =
!listconcat(Power7FeatureList, Power8SpecificFeatures);		!listconcat(Power7FeatureList, Power8SpecificFeatures);
list<SubtargetFeature> Power9SpecificFeatures =		list<SubtargetFeature> Power9SpecificFeatures =
[DirectivePwr9, FeatureP9Altivec, FeatureP9Vector, FeatureISA3_0];		[DirectivePwr9, FeatureP9Altivec, FeatureP9Vector, FeatureISA3_0,
		FeatureVectorsUseTwoUnits];
list<SubtargetFeature> Power9FeatureList =		list<SubtargetFeature> Power9FeatureList =
!listconcat(Power8FeatureList, Power9SpecificFeatures);		!listconcat(Power8FeatureList, Power9SpecificFeatures);
}		}

// Note: Future features to add when support is extended to more		// Note: Future features to add when support is extended to more
// recent ISA levels:		// recent ISA levels:
//		//
// DFP p6, p6x, p7 decimal floating-point instructions		// DFP p6, p6x, p7 decimal floating-point instructions
▲ Show 20 Lines • Show All 252 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/PowerPC/PPCSubtarget.h

Show First 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	protected:
bool HasPartwordAtomics;		bool HasPartwordAtomics;
bool HasDirectMove;		bool HasDirectMove;
bool HasHTM;		bool HasHTM;
bool HasFusion;		bool HasFusion;
bool HasFloat128;		bool HasFloat128;
bool IsISA3_0;		bool IsISA3_0;
bool UseLongCalls;		bool UseLongCalls;
bool SecurePlt;		bool SecurePlt;
		bool VectorsUseTwoUnits;

POPCNTDKind HasPOPCNTD;		POPCNTDKind HasPOPCNTD;

/// When targeting QPX running a stock PPC64 Linux kernel where the stack		/// When targeting QPX running a stock PPC64 Linux kernel where the stack
/// alignment has not been changed, we need to keep the 16-byte alignment		/// alignment has not been changed, we need to keep the 16-byte alignment
/// of the stack.		/// of the stack.
bool IsQPXStackUnaligned;		bool IsQPXStackUnaligned;

▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	public:
bool hasExtDiv() const { return HasExtDiv; }		bool hasExtDiv() const { return HasExtDiv; }
bool hasCMPB() const { return HasCMPB; }		bool hasCMPB() const { return HasCMPB; }
bool hasLDBRX() const { return HasLDBRX; }		bool hasLDBRX() const { return HasLDBRX; }
bool isBookE() const { return IsBookE; }		bool isBookE() const { return IsBookE; }
bool hasOnlyMSYNC() const { return HasOnlyMSYNC; }		bool hasOnlyMSYNC() const { return HasOnlyMSYNC; }
bool isPPC4xx() const { return IsPPC4xx; }		bool isPPC4xx() const { return IsPPC4xx; }
bool isPPC6xx() const { return IsPPC6xx; }		bool isPPC6xx() const { return IsPPC6xx; }
bool isSecurePlt() const {return SecurePlt; }		bool isSecurePlt() const {return SecurePlt; }
		bool vectorsUseTwoUnits() const {return VectorsUseTwoUnits; }
bool isE500() const { return IsE500; }		bool isE500() const { return IsE500; }
bool isFeatureMFTB() const { return FeatureMFTB; }		bool isFeatureMFTB() const { return FeatureMFTB; }
bool isDeprecatedDST() const { return DeprecatedDST; }		bool isDeprecatedDST() const { return DeprecatedDST; }
bool hasICBT() const { return HasICBT; }		bool hasICBT() const { return HasICBT; }
bool hasInvariantFunctionDescriptors() const {		bool hasInvariantFunctionDescriptors() const {
return HasInvariantFunctionDescriptors;		return HasInvariantFunctionDescriptors;
}		}
bool hasPartwordAtomics() const { return HasPartwordAtomics; }		bool hasPartwordAtomics() const { return HasPartwordAtomics; }
▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/PowerPC/PPCSubtarget.cpp

Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	void PPCSubtarget::initializeEnvironment() {
HasDirectMove = false;		HasDirectMove = false;
IsQPXStackUnaligned = false;		IsQPXStackUnaligned = false;
HasHTM = false;		HasHTM = false;
HasFusion = false;		HasFusion = false;
HasFloat128 = false;		HasFloat128 = false;
IsISA3_0 = false;		IsISA3_0 = false;
UseLongCalls = false;		UseLongCalls = false;
SecurePlt = false;		SecurePlt = false;
		VectorsUseTwoUnits = false;

HasPOPCNTD = POPCNTD_Unavailable;		HasPOPCNTD = POPCNTD_Unavailable;
}		}

void PPCSubtarget::initSubtargetFeatures(StringRef CPU, StringRef FS) {		void PPCSubtarget::initSubtargetFeatures(StringRef CPU, StringRef FS) {
// Determine default and user specified characteristics		// Determine default and user specified characteristics
std::string CPUName = CPU;		std::string CPUName = CPU;
if (CPUName.empty() \|\| CPU == "generic") {		if (CPUName.empty() \|\| CPU == "generic") {
▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.h

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	public:
const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(		const TTI::MemCmpExpansionOptions *enableMemCmpExpansion(
bool IsZeroCmp) const;		bool IsZeroCmp) const;
bool enableInterleavedAccessVectorization();		bool enableInterleavedAccessVectorization();
unsigned getNumberOfRegisters(bool Vector);		unsigned getNumberOfRegisters(bool Vector);
unsigned getRegisterBitWidth(bool Vector) const;		unsigned getRegisterBitWidth(bool Vector) const;
unsigned getCacheLineSize();		unsigned getCacheLineSize();
unsigned getPrefetchDistance();		unsigned getPrefetchDistance();
unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);
		int vectorCostAdjustment(int Cost, unsigned Opcode, Type Ty1, Type Ty2);
int getArithmeticInstrCost(		int getArithmeticInstrCost(
unsigned Opcode, Type *Ty,		unsigned Opcode, Type *Ty,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >());		ArrayRef<const Value > Args = ArrayRef<const Value >());
int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);		int getShuffleCost(TTI::ShuffleKind Kind, Type Tp, int Index, Type SubTp);
Show All 21 Lines

llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Show First 20 Lines • Show All 317 Lines • ▼ Show 20 Lines	if (Directive == PPC::DIR_PWR7 \|\| Directive == PPC::DIR_PWR8 \|\|
Directive == PPC::DIR_PWR9)		Directive == PPC::DIR_PWR9)
return 12;		return 12;

// For most things, modern systems have two execution units (and		// For most things, modern systems have two execution units (and
// out-of-order execution).		// out-of-order execution).
return 2;		return 2;
}		}

		// Adjust the cost of vector instructions on targets which there is overlap
		// between the vector and scalar units, thereby reducing the overall throughput
		// of vector code wrt. scalar code.
		int PPCTTIImpl::vectorCostAdjustment(int Cost, unsigned Opcode, Type *Ty1,
		Type *Ty2) {
		if (!ST->vectorsUseTwoUnits() \|\| !Ty1->isVectorTy())
		return Cost;

		std::pair<int, MVT> LT1 = TLI->getTypeLegalizationCost(DL, Ty1);
		// If type legalization involves splitting the vector, we don't want to
		// double the cost at every step - only the last step.
		if (LT1.first != 1 \|\| !LT1.second.isVector())
		return Cost;
		int ISD = TLI->InstructionOpcodeToISD(Opcode);
		if (TLI->isOperationExpand(ISD, LT1.second))
		return Cost;

		if (Ty2) {
		std::pair<int, MVT> LT2 = TLI->getTypeLegalizationCost(DL, Ty2);
		if (LT2.first != 1 \|\| !LT2.second.isVector())
		return Cost;
		}

		return Cost * 2;
		}

int PPCTTIImpl::getArithmeticInstrCost(		int PPCTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::OperandValueKind Op1Info,		unsigned Opcode, Type *Ty, TTI::OperandValueKind Op1Info,
TTI::OperandValueKind Op2Info, TTI::OperandValueProperties Opd1PropInfo,		TTI::OperandValueKind Op2Info, TTI::OperandValueProperties Opd1PropInfo,
TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args) {		TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args) {
assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");		assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");

// Fallback to the default implementation.		// Fallback to the default implementation.
return BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,		int Cost = BaseT::getArithmeticInstrCost(Opcode, Ty, Op1Info, Op2Info,
Opd1PropInfo, Opd2PropInfo);		Opd1PropInfo, Opd2PropInfo);
		return vectorCostAdjustment(Cost, Opcode, Ty, nullptr);
}		}

int PPCTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,		int PPCTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp, int Index,
Type *SubTp) {		Type *SubTp) {
// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);

// PPC, for both Altivec/VSX and QPX, support cheap arbitrary permutations		// PPC, for both Altivec/VSX and QPX, support cheap arbitrary permutations
// (at least in the sense that there need only be one non-loop-invariant		// (at least in the sense that there need only be one non-loop-invariant
// instruction). We need one such shuffle instruction for each actual		// instruction). We need one such shuffle instruction for each actual
// register (this is not true for arbitrary shuffles, but is true for the		// register (this is not true for arbitrary shuffles, but is true for the
// structured types of shuffles covered by TTI::ShuffleKind).		// structured types of shuffles covered by TTI::ShuffleKind).
return LT.first;		return vectorCostAdjustment(LT.first, Instruction::ShuffleVector, Tp,
		nullptr);
}		}

int PPCTTIImpl::getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		int PPCTTIImpl::getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
const Instruction *I) {		const Instruction *I) {
assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");		assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");

return BaseT::getCastInstrCost(Opcode, Dst, Src);		int Cost = BaseT::getCastInstrCost(Opcode, Dst, Src);
		return vectorCostAdjustment(Cost, Opcode, Dst, Src);
}		}

int PPCTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy,		int PPCTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type ValTy, Type CondTy,
const Instruction *I) {		const Instruction *I) {
return BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, I);		int Cost = BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, I);
		return vectorCostAdjustment(Cost, Opcode, ValTy, nullptr);
}		}

int PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index) {		int PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val, unsigned Index) {
assert(Val->isVectorTy() && "This must be a vector type");		assert(Val->isVectorTy() && "This must be a vector type");

int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

		int Cost = BaseT::getVectorInstrCost(Opcode, Val, Index);
		Cost = vectorCostAdjustment(Cost, Opcode, Val, nullptr);

if (ST->hasVSX() && Val->getScalarType()->isDoubleTy()) {		if (ST->hasVSX() && Val->getScalarType()->isDoubleTy()) {
// Double-precision scalars are already located in index #0.		// Double-precision scalars are already located in index #0 (or #1 if LE).
if (Index == 0)		if (ISD == ISD::EXTRACT_VECTOR_ELT && Index == ST->isLittleEndian() ? 1 : 0)
return 0;		return 0;

return BaseT::getVectorInstrCost(Opcode, Val, Index);		return Cost;

} else if (ST->hasQPX() && Val->getScalarType()->isFloatingPointTy()) {		} else if (ST->hasQPX() && Val->getScalarType()->isFloatingPointTy()) {
// Floating point scalars are already located in index #0.		// Floating point scalars are already located in index #0.
if (Index == 0)		if (Index == 0)
return 0;		return 0;

return BaseT::getVectorInstrCost(Opcode, Val, Index);		return Cost;
}		}

// Estimated cost of a load-hit-store delay. This was obtained		// Estimated cost of a load-hit-store delay. This was obtained
// experimentally as a minimum needed to prevent unprofitable		// experimentally as a minimum needed to prevent unprofitable
// vectorization for the paq8p benchmark. It may need to be		// vectorization for the paq8p benchmark. It may need to be
// raised further if other unprofitable cases remain.		// raised further if other unprofitable cases remain.
unsigned LHSPenalty = 2;		unsigned LHSPenalty = 2;
if (ISD == ISD::INSERT_VECTOR_ELT)		if (ISD == ISD::INSERT_VECTOR_ELT)
LHSPenalty += 7;		LHSPenalty += 7;

// Vector element insert/extract with Altivec is very expensive,		// Vector element insert/extract with Altivec is very expensive,
// because they require store and reload with the attendant		// because they require store and reload with the attendant
// processor stall for load-hit-store. Until VSX is available,		// processor stall for load-hit-store. Until VSX is available,
// these need to be estimated as very costly.		// these need to be estimated as very costly.
if (ISD == ISD::EXTRACT_VECTOR_ELT \|\|		if (ISD == ISD::EXTRACT_VECTOR_ELT \|\|
ISD == ISD::INSERT_VECTOR_ELT)		ISD == ISD::INSERT_VECTOR_ELT)
return LHSPenalty + BaseT::getVectorInstrCost(Opcode, Val, Index);		return LHSPenalty + Cost;

return BaseT::getVectorInstrCost(Opcode, Val, Index);		return Cost;
}		}

int PPCTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		int PPCTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace, const Instruction *I) {		unsigned AddressSpace, const Instruction *I) {
// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);
assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&		assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&
"Invalid Opcode");		"Invalid Opcode");

int Cost = BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace);		int Cost = BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace);
		Cost = vectorCostAdjustment(Cost, Opcode, Src, nullptr);

bool IsAltivecType = ST->hasAltivec() &&		bool IsAltivecType = ST->hasAltivec() &&
(LT.second == MVT::v16i8 \|\| LT.second == MVT::v8i16 \|\|		(LT.second == MVT::v16i8 \|\| LT.second == MVT::v8i16 \|\|
LT.second == MVT::v4i32 \|\| LT.second == MVT::v4f32);		LT.second == MVT::v4i32 \|\| LT.second == MVT::v4f32);
bool IsVSXType = ST->hasVSX() &&		bool IsVSXType = ST->hasVSX() &&
(LT.second == MVT::v2f64 \|\| LT.second == MVT::v2i64);		(LT.second == MVT::v2f64 \|\| LT.second == MVT::v2i64);
bool IsQPXType = ST->hasQPX() &&		bool IsQPXType = ST->hasQPX() &&
(LT.second == MVT::v4f64 \|\| LT.second == MVT::v4f32);		(LT.second == MVT::v4f64 \|\| LT.second == MVT::v4f32);
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/PowerPC/p9.ll

				; RUN: opt < %s -cost-model -analyze -mtriple=powerpc64-unknown-linux-gnu -mcpu=pwr7 -mattr=+vsx \| FileCheck %s
				; RUN: opt < %s -cost-model -analyze -mtriple=powerpc64-unknown-linux-gnu -mcpu=pwr9 -mattr=+vsx \| FileCheck --check-prefix=CHECK-P9 %s
				; RUN: opt < %s -cost-model -analyze -mtriple=powerpc64le-unknown-linux-gnu -mcpu=pwr9 -mattr=+vsx \| FileCheck --check-prefix=CHECK-LE %s

				define void @testi16(i16 %arg1, i16 %arg2, i16* %arg3) {

				%s1 = add i16 %arg1, %arg2
				%s2 = zext i16 %arg1 to i32
				%s3 = load i16, i16* %arg3
				store i16 %arg2, i16* %arg3
				%c = icmp eq i16 %arg1, %arg2

				ret void
				; CHECK: cost of 1 {{.*}} add
				; CHECK: cost of 1 {{.*}} zext
				; CHECK: cost of 1 {{.*}} load
				; CHECK: cost of 1 {{.*}} store
				; CHECK: cost of 1 {{.*}} icmp
				; CHECK-P9: cost of 1 {{.*}} add
				; CHECK-P9: cost of 1 {{.*}} zext
				; CHECK-P9: cost of 1 {{.*}} load
				; CHECK-P9: cost of 1 {{.*}} store
				; CHECK-P9: cost of 1 {{.*}} icmp
				}

				define void @test4xi16(<4 x i16> %arg1, <4 x i16> %arg2) {

				%v1 = add <4 x i16> %arg1, %arg2
				%v2 = zext <4 x i16> %arg1 to <4 x i32>
				%v3 = shufflevector <4 x i16> %arg1, <4 x i16> undef, <4 x i32> zeroinitializer
				%c = icmp eq <4 x i16> %arg1, %arg2

				ret void
				; CHECK: cost of 1 {{.*}} add
				; CHECK: cost of 1 {{.*}} zext
				; CHECK: cost of 1 {{.*}} shufflevector
				; CHECK: cost of 1 {{.*}} icmp
				; CHECK-P9: cost of 2 {{.*}} add
				; CHECK-P9: cost of 2 {{.*}} zext
				; CHECK-P9: cost of 2 {{.*}} shufflevector
				; CHECK-P9: cost of 2 {{.*}} icmp
				}

				define void @test4xi32(<4 x i32> %arg1, <4 x i32> %arg2, <4 x i32>* %arg3) {

				%v1 = load <4 x i32>, <4 x i32>* %arg3
				store <4 x i32> %arg2, <4 x i32>* %arg3

				ret void
				; CHECK: cost of 1 {{.*}} load
				; CHECK: cost of 1 {{.*}} store
				; CHECK-P9: cost of 2 {{.*}} load
				; CHECK-P9: cost of 2 {{.*}} store
				}

				define void @test2xdouble(<2 x double> %arg1) {
				%v1 = extractelement <2 x double> %arg1, i32 0
				%v2 = extractelement <2 x double> %arg1, i32 1

				ret void
				; CHECK: cost of 0 {{.*}} extractelement
				; CHECK: cost of 1 {{.*}} extractelement
				; CHECK-P9: cost of 0 {{.*}} extractelement
				; CHECK-P9: cost of 2 {{.*}} extractelement
				; CHECK-LE-LABEL: test2xdouble
				; CHECK-LE: cost of 2 {{.*}} extractelement
				; CHECK-LE: cost of 0 {{.*}} extractelement
				}

llvm/trunk/test/Transforms/SLPVectorizer/PowerPC/short-to-double.ll

				; RUN: opt -S -mtriple=powerpc64-linux-gnu -mcpu=pwr9 -mattr=+vsx -slp-vectorizer < %s \| FileCheck %s --check-prefix=CHECK-P9
				; RUN: opt -S -mtriple=powerpc64-linux-gnu -mcpu=pwr8 -mattr=+vsx -slp-vectorizer < %s \| FileCheck %s --check-prefix=CHECK-P8

				%struct._pp = type { i16, i16, i16, i16 }

				; Function Attrs: norecurse nounwind readonly
				define [5 x double] @foo(double %k, i64 %n, %struct._pp* nocapture readonly %p) local_unnamed_addr #0 {
				entry:
				%cmp17 = icmp sgt i64 %n, 0
				br i1 %cmp17, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body, %entry
				%retval.sroa.0.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
				%retval.sroa.4.0.lcssa = phi double [ 0.000000e+00, %entry ], [ %add10, %for.body ]
				%.fca.0.insert = insertvalue [5 x double] undef, double %retval.sroa.0.0.lcssa, 0
				%.fca.1.insert = insertvalue [5 x double] %.fca.0.insert, double %retval.sroa.4.0.lcssa, 1
				ret [5 x double] %.fca.1.insert

				for.body: ; preds = %entry, %for.body
				%i.020 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
				%retval.sroa.4.019 = phi double [ %add10, %for.body ], [ 0.000000e+00, %entry ]
				%retval.sroa.0.018 = phi double [ %add, %for.body ], [ 0.000000e+00, %entry ]
				%r1 = getelementptr inbounds %struct._pp, %struct._pp* %p, i64 %i.020, i32 2
				%0 = load i16, i16* %r1, align 2
				%conv2 = uitofp i16 %0 to double
				%mul = fmul double %conv2, %k
				%add = fadd double %retval.sroa.0.018, %mul
				%g5 = getelementptr inbounds %struct._pp, %struct._pp* %p, i64 %i.020, i32 1
				%1 = load i16, i16* %g5, align 2
				%conv7 = uitofp i16 %1 to double
				%mul8 = fmul double %conv7, %k
				%add10 = fadd double %retval.sroa.4.019, %mul8
				%inc = add nuw nsw i64 %i.020, 1
				%exitcond = icmp eq i64 %inc, %n
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; CHECK-P8: load <2 x i16>
				; CHECK-P9-NOT: load <2 x i16>

This is an archive of the discontinued LLVM Phabricator instance.

[PowerPC] Update Vector Costs for P9ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 183667

llvm/trunk/lib/Target/PowerPC/PPC.td

llvm/trunk/lib/Target/PowerPC/PPCSubtarget.h

llvm/trunk/lib/Target/PowerPC/PPCSubtarget.cpp

llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.h

llvm/trunk/lib/Target/PowerPC/PPCTargetTransformInfo.cpp

llvm/trunk/test/Analysis/CostModel/PowerPC/p9.ll

llvm/trunk/test/Transforms/SLPVectorizer/PowerPC/short-to-double.ll

[PowerPC] Update Vector Costs for P9
ClosedPublic