This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
-
PPCTargetTransformInfo.h
1/6
PPCTargetTransformInfo.cpp
-
test/Transforms/PGOProfile/
-
Transforms/
-
PGOProfile/
1/2
ppc-prevent-mma-types.ll

Differential D113900

[PowerPC] Prevent the optimizer from producing wide vector types in IR.
ClosedPublic

Authored by amyk on Nov 15 2021, 7:34 AM.

Download Raw Diff

Details

Reviewers

nemanjai
lei

Group Reviewers

Restricted Project

Commits

rG150681f2f322: [PowerPC] Prevent the optimizer from producing wide vector types in IR.

Summary

This patch prevents the optimizer from producing wide vectors in the IR, specifically the
MMA types (v256i1, v512i1). The idea is that on Power, we only want to be producing
these types only if the vector_pair and vector_quad types are used specifically.

To prevent the optimizer from producing these types in the IR, vectorCostAdjustmentFactor()
is updated to return an instruction cost factor or an invalid instruction cost if the current type is
that of an MMA type. An invalid instruction cost returned by this function signifies to other
cost computing functions to return the maximum instruction cost to inform the optimizer that
producing these types within the IR is expensive, and should not be produced in the first place.

This issue was first seen in the test case included within this patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

amyk created this revision.Nov 15 2021, 7:34 AM

Herald added subscribers: wenlei, shchenz, kbarton, hiraditya. · View Herald TranscriptNov 15 2021, 7:34 AM

amyk requested review of this revision.Nov 15 2021, 7:34 AM

Harbormaster completed remote builds in B134255: Diff 387007.Nov 15 2021, 8:15 AM

nemanjai added inline comments.Nov 15 2021, 9:24 AM

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
1010	Can you please explain why we don't want to do this check in the above condition?
1097	If we are exiting early, might as well do it before we do the ISD/Cost computations above.

Instead of adding:

if (isMMAType(Src))
    return InstructionCost::getMax();

to all the different cost calculating functions, is it possible to just add them the underlying functions?

For example I see this check was added multiple functions that also calls vectorCostAdjustment() or getMemoryOpCost().
I haven't tried this so not sure if it will work, but was thinking it would make sense to at least add this to vectorCostAdjustment()
since it's a vector type.

Discussed on refactoring this approach with Nemanja and Lei outside of this review.
The suggestion was to instead, transform the vectorCostAdjustment() function to return a cost adjustment
factor to be multiplied by rather than the newly computed cost.

This enables us to place code to check for the MMA types within vectorCostAdjustment() and call the function
first in all of the instruction cost functions, while allowing us to exit early if we happen to find an MMA type.
Moreover, since many functions rely on vectorCostAdjustment() for cost computation anyway, it reduces the
chances of missing the opportunity to return the maximum instruction cost when an MMA type is found.
Essentially, we want to guarantee that every instruction cost function can return a maximum instruction cost for
the MMA types and this change can assist in this goal.

Harbormaster completed remote builds in B135268: Diff 388704.Nov 20 2021, 9:08 AM

LGTM aside from a name and return type change.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
955–956	This should be returning `unsigned`. And the name should change to indicate that this is a factor.

This revision is now accepted and ready to land.Nov 24 2021, 10:10 AM

LGTM other then the comment about the unecessary code block.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
1038–1043	I don't think this is needed....

Update the patch to utilize an invalid InstructionCost, and return InstructionCost in vectorCostAdjustment. This was discussed with Nemanja and Lei outside of the Phabricator review.

LGTM. Thanks for the update.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
958	This comment is now out of date. Please update it on the commit.

amyk edited the summary of this revision. (Show Details)Nov 25 2021, 9:14 AM

amyk added inline comments.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
958	Thanks Nemanja for the review. I'll update it on the commit.

Harbormaster completed remote builds in B136085: Diff 389815.Nov 25 2021, 9:34 AM

Closed by commit rG150681f2f322: [PowerPC] Prevent the optimizer from producing wide vector types in IR. (authored by amyk). · Explain WhyNov 25 2021, 10:36 AM

This revision was automatically updated to reflect the committed changes.

amyk added a commit: rG150681f2f322: [PowerPC] Prevent the optimizer from producing wide vector types in IR..

I happened to read the internal issue this patch trying to resolve, my question may be out of this patch, is LLVM back-end supposed to generate proper code for any legal LLVM IR? As I still see llc failed to compile attached LLVM IR with -mcpu=pwr10, but succeed with -mcpu=pwr9

v256.ll16 KBDownload

.
A fact is that new types will be added to MVT, is back-end for any target responsible to handle all newly added MVTs(setting actions to custom, expand, promote and etc.)?

fhahn added a subscriber: fhahn.Jan 14 2022, 4:05 AM

fhahn added inline comments.

llvm/test/Transforms/PGOProfile/ppc-prevent-mma-types.ll
4	Is there a reason this test checks the full pipeline compare to just the SLP vectorizer pass? The changes look completely unrelated to PGO/profiling.

amyk mentioned this in D118142: [PowerPC] Update ppc-prevent-mma-types.ll with custom opt pipeline.Jan 25 2022, 6:45 AM

amyk added inline comments.

llvm/test/Transforms/PGOProfile/ppc-prevent-mma-types.ll
4	Thanks for taking a look at this. I originally found this test case under the full PGO pipeline. Then, I realized the SLP vectorizer was also an important culprit in reproducing the issue, but it appears to be a bit more complex than that. It seems like there are other passes that come into play when trying to reproduce this issue. I've attempted to find a minimal set of passes and I've posted a patch to hopefully be a bit more specific with the pass pipeline: https://reviews.llvm.org/D118142 I'd appreciate your opinion on the updated pass pipeline.

Revision Contents

Path

Size

llvm/

lib/

Target/

PowerPC/

PPCTargetTransformInfo.h

4 lines

PPCTargetTransformInfo.cpp

91 lines

test/

Transforms/

PGOProfile/

ppc-prevent-mma-types.ll

204 lines

Diff 389841

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.h

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	public:
};		};
unsigned getNumberOfRegisters(unsigned ClassID) const;		unsigned getNumberOfRegisters(unsigned ClassID) const;
unsigned getRegisterClassForType(bool Vector, Type *Ty = nullptr) const;		unsigned getRegisterClassForType(bool Vector, Type *Ty = nullptr) const;
const char* getRegisterClassName(unsigned ClassID) const;		const char* getRegisterClassName(unsigned ClassID) const;
TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const;		TypeSize getRegisterBitWidth(TargetTransformInfo::RegisterKind K) const;
unsigned getCacheLineSize() const override;		unsigned getCacheLineSize() const override;
unsigned getPrefetchDistance() const override;		unsigned getPrefetchDistance() const override;
unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);
InstructionCost vectorCostAdjustment(InstructionCost Cost, unsigned Opcode,		InstructionCost vectorCostAdjustmentFactor(unsigned Opcode, Type *Ty1,
Type Ty1, Type Ty2);		Type *Ty2);
InstructionCost getArithmeticInstrCost(		InstructionCost getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,		unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd1Info = TTI::OK_AnyValue,
TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,		TTI::OperandValueKind Opd2Info = TTI::OK_AnyValue,
TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd1PropInfo = TTI::OP_None,
TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,		TTI::OperandValueProperties Opd2PropInfo = TTI::OP_None,
ArrayRef<const Value > Args = ArrayRef<const Value >(),		ArrayRef<const Value > Args = ArrayRef<const Value >(),
const Instruction *CxtI = nullptr);		const Instruction *CxtI = nullptr);
Show All 33 Lines

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Show First 20 Lines • Show All 312 Lines • ▼ Show 20 Lines	if (Idx == ImmIdx && Imm.getBitWidth() <= 64) {

if (ShiftedFree && (Imm.getZExtValue() & 0xFFFF) == 0)		if (ShiftedFree && (Imm.getZExtValue() & 0xFFFF) == 0)
return TTI::TCC_Free;		return TTI::TCC_Free;
}		}

return PPCTTIImpl::getIntImmCost(Imm, Ty, CostKind);		return PPCTTIImpl::getIntImmCost(Imm, Ty, CostKind);
}		}

		// Check if the current Type is an MMA vector type. Valid MMA types are
		// v256i1 and v512i1 respectively.
		static bool isMMAType(Type *Ty) {
		return Ty->isVectorTy() && (Ty->getScalarSizeInBits() == 1) &&
		(Ty->getPrimitiveSizeInBits() > 128);
		}

InstructionCost PPCTTIImpl::getUserCost(const User *U,		InstructionCost PPCTTIImpl::getUserCost(const User *U,
ArrayRef<const Value *> Operands,		ArrayRef<const Value *> Operands,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(U->getType()))
		return InstructionCost::getMax();

// We already implement getCastInstrCost and getMemoryOpCost where we perform		// We already implement getCastInstrCost and getMemoryOpCost where we perform
// the vector adjustment there.		// the vector adjustment there.
if (isa<CastInst>(U) \|\| isa<LoadInst>(U) \|\| isa<StoreInst>(U))		if (isa<CastInst>(U) \|\| isa<LoadInst>(U) \|\| isa<StoreInst>(U))
return BaseT::getUserCost(U, Operands, CostKind);		return BaseT::getUserCost(U, Operands, CostKind);

if (U->getType()->isVectorTy()) {		if (U->getType()->isVectorTy()) {
// Instructions that need to be split should cost more.		// Instructions that need to be split should cost more.
std::pair<InstructionCost, MVT> LT =		std::pair<InstructionCost, MVT> LT =
▲ Show 20 Lines • Show All 604 Lines • ▼ Show 20 Lines	if (Directive == PPC::DIR_PWR7 \|\| Directive == PPC::DIR_PWR8 \|\|
Directive == PPC::DIR_PWR9 \|\| Directive == PPC::DIR_PWR10 \|\|		Directive == PPC::DIR_PWR9 \|\| Directive == PPC::DIR_PWR10 \|\|
Directive == PPC::DIR_PWR_FUTURE)		Directive == PPC::DIR_PWR_FUTURE)
return 12;		return 12;

// For most things, modern systems have two execution units (and		// For most things, modern systems have two execution units (and
// out-of-order execution).		// out-of-order execution).
return 2;		return 2;
}		}

// Adjust the cost of vector instructions on targets which there is overlap		// Returns a cost adjustment factor to adjust the cost of vector instructions
		nemanjaiUnsubmitted Not Done Reply Inline Actions This should be returning `unsigned`. And the name should change to indicate that this is a factor. nemanjai: This should be returning `unsigned`. And the name should change to indicate that this is a…
// between the vector and scalar units, thereby reducing the overall throughput		// on targets which there is overlap between the vector and scalar units,
// of vector code wrt. scalar code.		// thereby reducing the overall throughput of vector code wrt. scalar code.
		nemanjaiUnsubmitted Not Done Reply Inline Actions This comment is now out of date. Please update it on the commit. nemanjai: This comment is now out of date. Please update it on the commit.
		amykAuthorUnsubmitted Done Reply Inline Actions Thanks Nemanja for the review. I'll update it on the commit. amyk: Thanks Nemanja for the review. I'll update it on the commit.
InstructionCost PPCTTIImpl::vectorCostAdjustment(InstructionCost Cost,		// An invalid instruction cost is returned if the type is an MMA vector type.
unsigned Opcode, Type *Ty1,		InstructionCost PPCTTIImpl::vectorCostAdjustmentFactor(unsigned Opcode,
Type *Ty2) {		Type Ty1, Type Ty2) {
		// If the vector type is of an MMA type (v256i1, v512i1), an invalid
		// instruction cost is returned. This is to signify to other cost computing
		// functions to return the maximum instruction cost in order to prevent any
		// opportunities for the optimizer to produce MMA types within the IR.
		if (isMMAType(Ty1))
		return InstructionCost::getInvalid();

if (!ST->vectorsUseTwoUnits() \|\| !Ty1->isVectorTy())		if (!ST->vectorsUseTwoUnits() \|\| !Ty1->isVectorTy())
return Cost;		return InstructionCost(1);

std::pair<InstructionCost, MVT> LT1 = TLI->getTypeLegalizationCost(DL, Ty1);		std::pair<InstructionCost, MVT> LT1 = TLI->getTypeLegalizationCost(DL, Ty1);
// If type legalization involves splitting the vector, we don't want to		// If type legalization involves splitting the vector, we don't want to
// double the cost at every step - only the last step.		// double the cost at every step - only the last step.
if (LT1.first != 1 \|\| !LT1.second.isVector())		if (LT1.first != 1 \|\| !LT1.second.isVector())
return Cost;		return InstructionCost(1);

int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
if (TLI->isOperationExpand(ISD, LT1.second))		if (TLI->isOperationExpand(ISD, LT1.second))
return Cost;		return InstructionCost(1);

if (Ty2) {		if (Ty2) {
std::pair<InstructionCost, MVT> LT2 = TLI->getTypeLegalizationCost(DL, Ty2);		std::pair<InstructionCost, MVT> LT2 = TLI->getTypeLegalizationCost(DL, Ty2);
if (LT2.first != 1 \|\| !LT2.second.isVector())		if (LT2.first != 1 \|\| !LT2.second.isVector())
return Cost;		return InstructionCost(1);
}		}

return Cost * 2;		return InstructionCost(2);
}		}

InstructionCost PPCTTIImpl::getArithmeticInstrCost(		InstructionCost PPCTTIImpl::getArithmeticInstrCost(
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,		unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
TTI::OperandValueKind Op1Info, TTI::OperandValueKind Op2Info,		TTI::OperandValueKind Op1Info, TTI::OperandValueKind Op2Info,
TTI::OperandValueProperties Opd1PropInfo,		TTI::OperandValueProperties Opd1PropInfo,
TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args,		TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args,
const Instruction *CxtI) {		const Instruction *CxtI) {
assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");		assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");

		InstructionCost CostFactor = vectorCostAdjustmentFactor(Opcode, Ty, nullptr);
		if (!CostFactor.isValid())
		return InstructionCost::getMax();

// TODO: Handle more cost kinds.		// TODO: Handle more cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,		return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,
Op2Info, Opd1PropInfo,		Op2Info, Opd1PropInfo,
Opd2PropInfo, Args, CxtI);		Opd2PropInfo, Args, CxtI);

// Fallback to the default implementation.		// Fallback to the default implementation.
InstructionCost Cost = BaseT::getArithmeticInstrCost(		InstructionCost Cost = BaseT::getArithmeticInstrCost(
		nemanjaiUnsubmitted Not Done Reply Inline Actions Can you please explain why we don't want to do this check in the above condition? nemanjai: Can you please explain why we don't want to do this check in the above condition?
Opcode, Ty, CostKind, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);		Opcode, Ty, CostKind, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
return vectorCostAdjustment(Cost, Opcode, Ty, nullptr);		return Cost * CostFactor;
}		}

InstructionCost PPCTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp,		InstructionCost PPCTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
Type *SubTp) {		Type *SubTp) {

		InstructionCost CostFactor =
		vectorCostAdjustmentFactor(Instruction::ShuffleVector, Tp, nullptr);
		if (!CostFactor.isValid())
		return InstructionCost::getMax();

// Legalize the type.		// Legalize the type.
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);

// PPC, for both Altivec/VSX, support cheap arbitrary permutations		// PPC, for both Altivec/VSX, support cheap arbitrary permutations
// (at least in the sense that there need only be one non-loop-invariant		// (at least in the sense that there need only be one non-loop-invariant
// instruction). We need one such shuffle instruction for each actual		// instruction). We need one such shuffle instruction for each actual
// register (this is not true for arbitrary shuffles, but is true for the		// register (this is not true for arbitrary shuffles, but is true for the
// structured types of shuffles covered by TTI::ShuffleKind).		// structured types of shuffles covered by TTI::ShuffleKind).
return vectorCostAdjustment(LT.first, Instruction::ShuffleVector, Tp,		return LT.first * CostFactor;
nullptr);
}		}

InstructionCost PPCTTIImpl::getCFInstrCost(unsigned Opcode,		InstructionCost PPCTTIImpl::getCFInstrCost(unsigned Opcode,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Opcode == Instruction::PHI ? 0 : 1;		return Opcode == Instruction::PHI ? 0 : 1;
// Branches are assumed to be predicted.		// Branches are assumed to be predicted.
return 0;		return 0;
}		}

		leiUnsubmitted Not Done Reply Inline Actions I don't think this is needed.... lei: I don't think this is needed....
InstructionCost PPCTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,		InstructionCost PPCTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
Type *Src,		Type *Src,
TTI::CastContextHint CCH,		TTI::CastContextHint CCH,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");		assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");

		InstructionCost CostFactor = vectorCostAdjustmentFactor(Opcode, Dst, Src);
		if (!CostFactor.isValid())
		return InstructionCost::getMax();

InstructionCost Cost =		InstructionCost Cost =
BaseT::getCastInstrCost(Opcode, Dst, Src, CCH, CostKind, I);		BaseT::getCastInstrCost(Opcode, Dst, Src, CCH, CostKind, I);
Cost = vectorCostAdjustment(Cost, Opcode, Dst, Src);		Cost *= CostFactor;
// TODO: Allow non-throughput costs that aren't binary.		// TODO: Allow non-throughput costs that aren't binary.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Cost == 0 ? 0 : 1;		return Cost == 0 ? 0 : 1;
return Cost;		return Cost;
}		}

InstructionCost PPCTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy,		InstructionCost PPCTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy,
Type *CondTy,		Type *CondTy,
CmpInst::Predicate VecPred,		CmpInst::Predicate VecPred,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
		InstructionCost CostFactor =
		vectorCostAdjustmentFactor(Opcode, ValTy, nullptr);
		if (!CostFactor.isValid())
		return InstructionCost::getMax();

InstructionCost Cost =		InstructionCost Cost =
BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, VecPred, CostKind, I);		BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, VecPred, CostKind, I);
// TODO: Handle other cost kinds.		// TODO: Handle other cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Cost;		return Cost;
return vectorCostAdjustment(Cost, Opcode, ValTy, nullptr);		return Cost * CostFactor;
}		}

InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,		InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
unsigned Index) {		unsigned Index) {
assert(Val->isVectorTy() && "This must be a vector type");		assert(Val->isVectorTy() && "This must be a vector type");

int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

		InstructionCost CostFactor = vectorCostAdjustmentFactor(Opcode, Val, nullptr);
		if (!CostFactor.isValid())
		return InstructionCost::getMax();

InstructionCost Cost = BaseT::getVectorInstrCost(Opcode, Val, Index);		InstructionCost Cost = BaseT::getVectorInstrCost(Opcode, Val, Index);
Cost = vectorCostAdjustment(Cost, Opcode, Val, nullptr);		Cost *= CostFactor;

if (ST->hasVSX() && Val->getScalarType()->isDoubleTy()) {		if (ST->hasVSX() && Val->getScalarType()->isDoubleTy()) {
// Double-precision scalars are already located in index #0 (or #1 if LE).		// Double-precision scalars are already located in index #0 (or #1 if LE).
		nemanjaiUnsubmitted Not Done Reply Inline Actions If we are exiting early, might as well do it before we do the ISD/Cost computations above. nemanjai: If we are exiting early, might as well do it before we do the ISD/Cost computations above.
if (ISD == ISD::EXTRACT_VECTOR_ELT &&		if (ISD == ISD::EXTRACT_VECTOR_ELT &&
Index == (ST->isLittleEndian() ? 1 : 0))		Index == (ST->isLittleEndian() ? 1 : 0))
return 0;		return 0;

return Cost;		return Cost;

} else if (Val->getScalarType()->isIntegerTy() && Index != -1U) {		} else if (Val->getScalarType()->isIntegerTy() && Index != -1U) {
if (ST->hasP9Altivec()) {		if (ST->hasP9Altivec()) {
if (ISD == ISD::INSERT_VECTOR_ELT)		if (ISD == ISD::INSERT_VECTOR_ELT)
// A move-to VSR and a permute/insert. Assume vector operation cost		// A move-to VSR and a permute/insert. Assume vector operation cost
// for both (cost will be 2x on P9).		// for both (cost will be 2x on P9).
return vectorCostAdjustment(2, Opcode, Val, nullptr);		return 2 * CostFactor;

// It's an extract. Maybe we can do a cheap move-from VSR.		// It's an extract. Maybe we can do a cheap move-from VSR.
unsigned EltSize = Val->getScalarSizeInBits();		unsigned EltSize = Val->getScalarSizeInBits();
if (EltSize == 64) {		if (EltSize == 64) {
unsigned MfvsrdIndex = ST->isLittleEndian() ? 1 : 0;		unsigned MfvsrdIndex = ST->isLittleEndian() ? 1 : 0;
if (Index == MfvsrdIndex)		if (Index == MfvsrdIndex)
return 1;		return 1;
} else if (EltSize == 32) {		} else if (EltSize == 32) {
unsigned MfvsrwzIndex = ST->isLittleEndian() ? 2 : 1;		unsigned MfvsrwzIndex = ST->isLittleEndian() ? 2 : 1;
if (Index == MfvsrwzIndex)		if (Index == MfvsrwzIndex)
return 1;		return 1;
}		}

// We need a vector extract (or mfvsrld). Assume vector operation cost.		// We need a vector extract (or mfvsrld). Assume vector operation cost.
// The cost of the load constant for a vector extract is disregarded		// The cost of the load constant for a vector extract is disregarded
// (invariant, easily schedulable).		// (invariant, easily schedulable).
return vectorCostAdjustment(1, Opcode, Val, nullptr);		return CostFactor;

} else if (ST->hasDirectMove())		} else if (ST->hasDirectMove())
// Assume permute has standard cost.		// Assume permute has standard cost.
// Assume move-to/move-from VSR have 2x standard cost.		// Assume move-to/move-from VSR have 2x standard cost.
return 3;		return 3;
}		}

// Estimated cost of a load-hit-store delay. This was obtained		// Estimated cost of a load-hit-store delay. This was obtained
Show All 15 Lines	InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
return Cost;		return Cost;
}		}

InstructionCost PPCTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,		InstructionCost PPCTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
MaybeAlign Alignment,		MaybeAlign Alignment,
unsigned AddressSpace,		unsigned AddressSpace,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {

		InstructionCost CostFactor = vectorCostAdjustmentFactor(Opcode, Src, nullptr);
		if (!CostFactor.isValid())
		return InstructionCost::getMax();

if (TLI->getValueType(DL, Src, true) == MVT::Other)		if (TLI->getValueType(DL, Src, true) == MVT::Other)
return BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace,		return BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace,
CostKind);		CostKind);
// Legalize the type.		// Legalize the type.
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);
assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&		assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&
"Invalid Opcode");		"Invalid Opcode");

InstructionCost Cost =		InstructionCost Cost =
BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace, CostKind);		BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace, CostKind);
// TODO: Handle other cost kinds.		// TODO: Handle other cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Cost;		return Cost;

Cost = vectorCostAdjustment(Cost, Opcode, Src, nullptr);		Cost *= CostFactor;

bool IsAltivecType = ST->hasAltivec() &&		bool IsAltivecType = ST->hasAltivec() &&
(LT.second == MVT::v16i8 \|\| LT.second == MVT::v8i16 \|\|		(LT.second == MVT::v16i8 \|\| LT.second == MVT::v8i16 \|\|
LT.second == MVT::v4i32 \|\| LT.second == MVT::v4f32);		LT.second == MVT::v4i32 \|\| LT.second == MVT::v4f32);
bool IsVSXType = ST->hasVSX() &&		bool IsVSXType = ST->hasVSX() &&
(LT.second == MVT::v2f64 \|\| LT.second == MVT::v2i64);		(LT.second == MVT::v2f64 \|\| LT.second == MVT::v2i64);

// VSX has 32b/64b load instructions. Legalization can handle loading of		// VSX has 32b/64b load instructions. Legalization can handle loading of
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	InstructionCost PPCTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,

return Cost;		return Cost;
}		}

InstructionCost PPCTTIImpl::getInterleavedMemoryOpCost(		InstructionCost PPCTTIImpl::getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond, bool UseMaskForGaps) {		bool UseMaskForCond, bool UseMaskForGaps) {
		InstructionCost CostFactor =
		vectorCostAdjustmentFactor(Opcode, VecTy, nullptr);
		if (!CostFactor.isValid())
		return InstructionCost::getMax();

if (UseMaskForCond \|\| UseMaskForGaps)		if (UseMaskForCond \|\| UseMaskForGaps)
return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind,		Alignment, AddressSpace, CostKind,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);

assert(isa<VectorType>(VecTy) &&		assert(isa<VectorType>(VecTy) &&
"Expect a vector type for interleaved memory op");		"Expect a vector type for interleaved memory op");

▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

llvm/test/Transforms/PGOProfile/ppc-prevent-mma-types.ll

This file was added.

				; RUN: opt --vec-extabi=true -passes='default<O3>' -mcpu=pwr10 \
				; RUN: -pgo-kind=pgo-instr-gen-pipeline -mtriple=powerpc-ibm-aix -S < %s \| \
				; RUN: FileCheck %s
				; RUN: opt -passes='default<O3>' -mcpu=pwr10 -pgo-kind=pgo-instr-gen-pipeline \
				fhahnUnsubmitted Not Done Reply Inline Actions Is there a reason this test checks the full pipeline compare to just the SLP vectorizer pass? The changes look completely unrelated to PGO/profiling. fhahn: Is there a reason this test checks the full pipeline compare to just the SLP vectorizer pass?
				amykAuthorUnsubmitted Done Reply Inline Actions Thanks for taking a look at this. I originally found this test case under the full PGO pipeline. Then, I realized the SLP vectorizer was also an important culprit in reproducing the issue, but it appears to be a bit more complex than that. It seems like there are other passes that come into play when trying to reproduce this issue. I've attempted to find a minimal set of passes and I've posted a patch to hopefully be a bit more specific with the pass pipeline: https://reviews.llvm.org/D118142 I'd appreciate your opinion on the updated pass pipeline. amyk: Thanks for taking a look at this. I originally found this test case under the full PGO pipeline.
				; RUN: -mtriple=powerpc64le-unknown-linux-gnu -S < %s \| FileCheck %s

				; When running this test case under opt + PGO, the SLPVectorizer previously had
				; an opportunity to produce wide vector types (such as <256 x i1>) within the
				; IR as it deemed these wide vector types to be cheap enough to produce.
				; Having this test ensures that the optimizer no longer generates wide vectors
				; within the IR.

				%0 = type <{ double }>
				%1 = type <{ [1 x %0]*, i8, i8, i8, i8, i32, i32, i32, [1 x i32], [1 x i32], [1 x i32], [24 x i8] }>
				declare i8* @__malloc()
				; CHECK-NOT: <256 x i1>
				; CHECK-NOT: <512 x i1>
				define dso_local void @test([0 x %0]* %arg, i32* %arg1, i32* %arg2, i32* %arg3, i32* %arg4) {
				%i = alloca [0 x %0]*, align 4
				store [0 x %0]* %arg, [0 x %0]** %i, align 4
				%i7 = alloca i32*, align 4
				store i32* %arg1, i32** %i7, align 4
				%i9 = alloca i32*, align 4
				store i32* %arg2, i32** %i9, align 4
				%i10 = alloca i32*, align 4
				store i32* %arg3, i32** %i10, align 4
				%i11 = alloca i32*, align 4
				store i32* %arg4, i32** %i11, align 4
				%i14 = alloca %1, align 4
				%i15 = alloca i32, align 4
				%i16 = alloca i32, align 4
				%i17 = alloca i32, align 4
				%i18 = alloca i32, align 4
				%i20 = alloca i32, align 4
				%i21 = alloca i32, align 4
				%i22 = alloca i32, align 4
				%i23 = alloca i32, align 4
				%i25 = alloca double, align 8
				%i26 = load i32, i32* %i9, align 4
				%i27 = load i32, i32* %i26, align 4
				%i28 = select i1 false, i32 0, i32 %i27
				store i32 %i28, i32* %i15, align 4
				%i29 = load i32, i32* %i7, align 4
				%i30 = load i32, i32* %i29, align 4
				%i31 = select i1 false, i32 0, i32 %i30
				store i32 %i31, i32* %i16, align 4
				%i32 = load i32, i32* %i15, align 4
				%i33 = mul i32 8, %i32
				store i32 %i33, i32* %i17, align 4
				%i34 = load i32, i32* %i17, align 4
				%i35 = load i32, i32* %i16, align 4
				%i36 = mul i32 %i34, %i35
				store i32 %i36, i32* %i18, align 4
				%i37 = load i32, i32* %i9, align 4
				%i38 = load i32, i32* %i37, align 4
				%i39 = select i1 false, i32 0, i32 %i38
				store i32 %i39, i32* %i22, align 4
				%i40 = load i32, i32* %i10, align 4
				%i41 = load i32, i32* %i40, align 4
				%i42 = select i1 false, i32 0, i32 %i41
				store i32 %i42, i32* %i23, align 4
				%i43 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i44 = bitcast [1 x i32]* %i43 to i8*
				%i45 = getelementptr i8, i8* %i44, i32 -12
				%i46 = getelementptr inbounds i8, i8* %i45, i32 12
				%i47 = bitcast i8* %i46 to i32*
				%i48 = load i32, i32* %i23, align 4
				%i49 = select i1 false, i32 0, i32 %i48
				%i50 = load i32, i32* %i22, align 4
				%i51 = select i1 false, i32 0, i32 %i50
				%i52 = mul i32 %i51, 8
				%i53 = mul i32 %i49, %i52
				store i32 %i53, i32* %i47, align 4
				%i54 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i55 = bitcast [1 x i32]* %i54 to i8*
				%i56 = getelementptr i8, i8* %i55, i32 -12
				%i57 = getelementptr inbounds i8, i8* %i56, i32 36
				%i58 = bitcast i8* %i57 to i32*
				store i32 8, i32* %i58, align 4
				%i60 = getelementptr inbounds %1, %1* %i14, i32 0, i32 0
				%i61 = call i8* @__malloc()
				%i62 = bitcast [1 x %0] %i60 to i8
				store i8* %i61, i8** %i62, align 4
				br label %bb63
				bb63: ; preds = %bb66, %bb
				%i64 = load i32, i32* %i11, align 4
				%i65 = load i32, i32* %i64, align 4
				br label %bb66
				bb66: ; preds = %bb165, %bb63
				%i67 = load i32, i32* %i21, align 4
				%i68 = icmp sle i32 %i67, %i65
				br i1 %i68, label %bb69, label %bb63
				bb69: ; preds = %bb66
				store i32 1, i32* %i20, align 4
				br label %bb70
				bb70: ; preds = %bb163, %bb69
				%i71 = load i32, i32* %i20, align 4
				%i72 = icmp sle i32 %i71, 11
				br i1 %i72, label %bb73, label %bb165
				bb73: ; preds = %bb70
				%i74 = load i32, i32* %i21, align 4
				%i76 = mul i32 %i74, 8
				%i77 = getelementptr inbounds i8, i8* null, i32 %i76
				%i78 = bitcast i8* %i77 to double*
				%i79 = load double, double* %i78, align 8
				%i80 = fcmp fast olt double %i79, 0.000000e+00
				%i81 = zext i1 %i80 to i32
				%i82 = trunc i32 %i81 to i1
				br i1 %i82, label %bb83, label %bb102
				bb83: ; preds = %bb73
				%i84 = getelementptr inbounds %1, %1* %i14, i32 0, i32 0
				%i85 = load [1 x %0], [1 x %0]* %i84, align 4
				%i86 = bitcast [1 x %0]* %i85 to i8*
				%i87 = getelementptr i8, i8* %i86, i32 0
				%i88 = load i32, i32* %i20, align 4
				%i89 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i90 = getelementptr inbounds [1 x i32], [1 x i32]* %i89, i32 0, i32 0
				%i91 = load i32, i32* %i90, align 4
				%i92 = mul i32 %i88, %i91
				%i93 = getelementptr inbounds i8, i8* %i87, i32 %i92
				%i94 = getelementptr inbounds i8, i8* %i93, i32 0
				%i95 = load i32, i32* %i21, align 4
				%i96 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i97 = getelementptr inbounds [1 x i32], [1 x i32]* %i96, i32 0, i32 6
				%i98 = load i32, i32* %i97, align 4
				%i99 = mul i32 %i95, %i98
				%i100 = getelementptr inbounds i8, i8* %i94, i32 %i99
				%i101 = bitcast i8* %i100 to double*
				store double 0.000000e+00, double* %i101, align 8
				br label %bb163
				bb102: ; preds = %bb73
				%i103 = getelementptr i8, i8* null, i32 -8
				%i104 = getelementptr inbounds i8, i8* %i103, i32 undef
				%i105 = bitcast i8* %i104 to double*
				%i106 = load double, double* %i105, align 8
				%i107 = load [0 x %0], [0 x %0]* %i, align 4
				%i108 = bitcast [0 x %0]* %i107 to i8*
				%i109 = getelementptr i8, i8* %i108, i32 -8
				%i110 = getelementptr inbounds i8, i8* %i109, i32 undef
				%i111 = bitcast i8* %i110 to double*
				%i112 = load double, double* %i111, align 8
				%i113 = fmul fast double %i106, %i112
				%i114 = fcmp fast ogt double 0.000000e+00, %i113
				%i115 = zext i1 %i114 to i32
				%i116 = trunc i32 %i115 to i1
				br i1 %i116, label %bb117, label %bb136
				bb117: ; preds = %bb102
				%i118 = getelementptr inbounds %1, %1* %i14, i32 0, i32 0
				%i119 = load [1 x %0], [1 x %0]* %i118, align 4
				%i120 = bitcast [1 x %0]* %i119 to i8*
				%i121 = getelementptr i8, i8* %i120, i32 0
				%i122 = load i32, i32* %i20, align 4
				%i123 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i124 = getelementptr inbounds [1 x i32], [1 x i32]* %i123, i32 0, i32 0
				%i125 = load i32, i32* %i124, align 4
				%i126 = mul i32 %i122, %i125
				%i127 = getelementptr inbounds i8, i8* %i121, i32 %i126
				%i128 = getelementptr inbounds i8, i8* %i127, i32 0
				%i129 = load i32, i32* %i21, align 4
				%i130 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i131 = getelementptr inbounds [1 x i32], [1 x i32]* %i130, i32 0, i32 6
				%i132 = load i32, i32* %i131, align 4
				%i133 = mul i32 %i129, %i132
				%i134 = getelementptr inbounds i8, i8* %i128, i32 %i133
				%i135 = bitcast i8* %i134 to double*
				store double 0.000000e+00, double* %i135, align 8
				br label %bb163
				bb136: ; preds = %bb102
				%i137 = load double, double* null, align 8
				%i138 = load double, double* null, align 8
				%i139 = fmul fast double %i137, %i138
				%i140 = fsub fast double 0.000000e+00, %i139
				store double %i140, double* %i25, align 8
				%i141 = load i32, i32* %i21, align 4
				%i143 = getelementptr inbounds [1 x i32], [1 x i32]* null, i32 0, i32 6
				%i144 = load i32, i32* %i143, align 4
				%i145 = mul i32 %i141, %i144
				%i146 = getelementptr inbounds i8, i8* null, i32 %i145
				%i147 = bitcast i8* %i146 to double*
				%i148 = load i32, i32* %i20, align 4
				%i149 = load i32, i32* %i18, align 4
				%i151 = mul i32 %i148, %i149
				%i152 = getelementptr i8, i8* null, i32 %i151
				%i153 = getelementptr i8, i8* %i152, i32 0
				%i154 = getelementptr inbounds i8, i8* %i153, i32 0
				%i155 = bitcast i8* %i154 to double*
				%i156 = load double, double* %i155, align 8
				%i157 = load double, double* %i25, align 8
				%i158 = fmul fast double %i156, %i157
				%i159 = fadd fast double 0.000000e+00, %i158
				%i160 = load double, double* %i25, align 8
				%i161 = fadd fast double 0.000000e+00, %i160
				%i162 = fdiv fast double %i159, %i161
				store double %i162, double* %i147, align 8
				br label %bb163
				bb163: ; preds = %bb136, %bb117, %bb83
				%i164 = add nsw i32 %i71, 1
				store i32 %i164, i32* %i20, align 4
				br label %bb70
				bb165: ; preds = %bb70
				%i166 = add nsw i32 %i67, 1
				store i32 %i166, i32* %i21, align 4
				br label %bb66
				}