This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
1/6
PPCTargetTransformInfo.cpp
-
test/Transforms/PGOProfile/
-
Transforms/
-
PGOProfile/
1/2
ppc-prevent-mma-types.ll

Differential D113900

[PowerPC] Prevent the optimizer from producing wide vector types in IR.
ClosedPublic

Authored by amyk on Nov 15 2021, 7:34 AM.

Download Raw Diff

Details

Reviewers

nemanjai
lei

Group Reviewers

Restricted Project

Commits

rG150681f2f322: [PowerPC] Prevent the optimizer from producing wide vector types in IR.

Summary

This patch prevents the optimizer from producing wide vectors in the IR, specifically the
MMA types (v256i1, v512i1). The idea is that on Power, we only want to be producing
these types only if the vector_pair and vector_quad types are used specifically.

To prevent the optimizer from producing these types in the IR, vectorCostAdjustmentFactor()
is updated to return an instruction cost factor or an invalid instruction cost if the current type is
that of an MMA type. An invalid instruction cost returned by this function signifies to other
cost computing functions to return the maximum instruction cost to inform the optimizer that
producing these types within the IR is expensive, and should not be produced in the first place.

This issue was first seen in the test case included within this patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

amyk created this revision.Nov 15 2021, 7:34 AM

Herald added subscribers: wenlei, shchenz, kbarton, hiraditya. · View Herald TranscriptNov 15 2021, 7:34 AM

amyk requested review of this revision.Nov 15 2021, 7:34 AM

Harbormaster completed remote builds in B134255: Diff 387007.Nov 15 2021, 8:15 AM

nemanjai added inline comments.Nov 15 2021, 9:24 AM

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
998	Can you please explain why we don't want to do this check in the above condition?
1083	If we are exiting early, might as well do it before we do the ISD/Cost computations above.

Instead of adding:

if (isMMAType(Src))
    return InstructionCost::getMax();

to all the different cost calculating functions, is it possible to just add them the underlying functions?

For example I see this check was added multiple functions that also calls vectorCostAdjustment() or getMemoryOpCost().
I haven't tried this so not sure if it will work, but was thinking it would make sense to at least add this to vectorCostAdjustment()
since it's a vector type.

Discussed on refactoring this approach with Nemanja and Lei outside of this review.
The suggestion was to instead, transform the vectorCostAdjustment() function to return a cost adjustment
factor to be multiplied by rather than the newly computed cost.

This enables us to place code to check for the MMA types within vectorCostAdjustment() and call the function
first in all of the instruction cost functions, while allowing us to exit early if we happen to find an MMA type.
Moreover, since many functions rely on vectorCostAdjustment() for cost computation anyway, it reduces the
chances of missing the opportunity to return the maximum instruction cost when an MMA type is found.
Essentially, we want to guarantee that every instruction cost function can return a maximum instruction cost for
the MMA types and this change can assist in this goal.

Harbormaster completed remote builds in B135268: Diff 388704.Nov 20 2021, 9:08 AM

LGTM aside from a name and return type change.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
960	This should be returning `unsigned`. And the name should change to indicate that this is a factor.

This revision is now accepted and ready to land.Nov 24 2021, 10:10 AM

LGTM other then the comment about the unecessary code block.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
1029–1034	I don't think this is needed....

Update the patch to utilize an invalid InstructionCost, and return InstructionCost in vectorCostAdjustment. This was discussed with Nemanja and Lei outside of the Phabricator review.

LGTM. Thanks for the update.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
963	This comment is now out of date. Please update it on the commit.

amyk edited the summary of this revision. (Show Details)Nov 25 2021, 9:14 AM

amyk added inline comments.

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp
963	Thanks Nemanja for the review. I'll update it on the commit.

Harbormaster completed remote builds in B136085: Diff 389815.Nov 25 2021, 9:34 AM

Closed by commit rG150681f2f322: [PowerPC] Prevent the optimizer from producing wide vector types in IR. (authored by amyk). · Explain WhyNov 25 2021, 10:36 AM

This revision was automatically updated to reflect the committed changes.

amyk added a commit: rG150681f2f322: [PowerPC] Prevent the optimizer from producing wide vector types in IR..

I happened to read the internal issue this patch trying to resolve, my question may be out of this patch, is LLVM back-end supposed to generate proper code for any legal LLVM IR? As I still see llc failed to compile attached LLVM IR with -mcpu=pwr10, but succeed with -mcpu=pwr9

v256.ll16 KBDownload

.
A fact is that new types will be added to MVT, is back-end for any target responsible to handle all newly added MVTs(setting actions to custom, expand, promote and etc.)?

fhahn added a subscriber: fhahn.Jan 14 2022, 4:05 AM

fhahn added inline comments.

llvm/test/Transforms/PGOProfile/ppc-prevent-mma-types.ll
5	Is there a reason this test checks the full pipeline compare to just the SLP vectorizer pass? The changes look completely unrelated to PGO/profiling.

amyk mentioned this in D118142: [PowerPC] Update ppc-prevent-mma-types.ll with custom opt pipeline.Jan 25 2022, 6:45 AM

amyk added inline comments.

llvm/test/Transforms/PGOProfile/ppc-prevent-mma-types.ll
5	Thanks for taking a look at this. I originally found this test case under the full PGO pipeline. Then, I realized the SLP vectorizer was also an important culprit in reproducing the issue, but it appears to be a bit more complex than that. It seems like there are other passes that come into play when trying to reproduce this issue. I've attempted to find a minimal set of passes and I've posted a patch to hopefully be a bit more specific with the pass pipeline: https://reviews.llvm.org/D118142 I'd appreciate your opinion on the updated pass pipeline.

Revision Contents

Path

Size

llvm/

lib/

Target/

PowerPC/

PPCTargetTransformInfo.cpp

39 lines

test/

Transforms/

PGOProfile/

ppc-prevent-mma-types.ll

205 lines

Diff 387007

llvm/lib/Target/PowerPC/PPCTargetTransformInfo.cpp

Show First 20 Lines • Show All 312 Lines • ▼ Show 20 Lines	if (Idx == ImmIdx && Imm.getBitWidth() <= 64) {

if (ShiftedFree && (Imm.getZExtValue() & 0xFFFF) == 0)		if (ShiftedFree && (Imm.getZExtValue() & 0xFFFF) == 0)
return TTI::TCC_Free;		return TTI::TCC_Free;
}		}

return PPCTTIImpl::getIntImmCost(Imm, Ty, CostKind);		return PPCTTIImpl::getIntImmCost(Imm, Ty, CostKind);
}		}

		// Check if the current Type is an MMA vector type. Valid MMA types are
		// v256i1 and v512i1 respectively.
		static bool isMMAType(Type *Ty) {
		return Ty->isVectorTy() && (Ty->getScalarSizeInBits() == 1) &&
		(Ty->getPrimitiveSizeInBits() > 128);
		}

InstructionCost PPCTTIImpl::getUserCost(const User *U,		InstructionCost PPCTTIImpl::getUserCost(const User *U,
ArrayRef<const Value *> Operands,		ArrayRef<const Value *> Operands,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
// We already implement getCastInstrCost and getMemoryOpCost where we perform		// We already implement getCastInstrCost and getMemoryOpCost where we perform
// the vector adjustment there.		// the vector adjustment there.
if (isa<CastInst>(U) \|\| isa<LoadInst>(U) \|\| isa<StoreInst>(U))		if (isa<CastInst>(U) \|\| isa<LoadInst>(U) \|\| isa<StoreInst>(U))
return BaseT::getUserCost(U, Operands, CostKind);		return BaseT::getUserCost(U, Operands, CostKind);

if (U->getType()->isVectorTy()) {		if (U->getType()->isVectorTy()) {
		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(U->getType()))
		return InstructionCost::getMax();

// Instructions that need to be split should cost more.		// Instructions that need to be split should cost more.
std::pair<InstructionCost, MVT> LT =		std::pair<InstructionCost, MVT> LT =
TLI->getTypeLegalizationCost(DL, U->getType());		TLI->getTypeLegalizationCost(DL, U->getType());
return LT.first * BaseT::getUserCost(U, Operands, CostKind);		return LT.first * BaseT::getUserCost(U, Operands, CostKind);
}		}

return BaseT::getUserCost(U, Operands, CostKind);		return BaseT::getUserCost(U, Operands, CostKind);
}		}
▲ Show 20 Lines • Show All 603 Lines • ▼ Show 20 Lines	unsigned PPCTTIImpl::getMaxInterleaveFactor(unsigned VF) {
// out-of-order execution).		// out-of-order execution).
return 2;		return 2;
}		}

// Adjust the cost of vector instructions on targets which there is overlap		// Adjust the cost of vector instructions on targets which there is overlap
// between the vector and scalar units, thereby reducing the overall throughput		// between the vector and scalar units, thereby reducing the overall throughput
// of vector code wrt. scalar code.		// of vector code wrt. scalar code.
InstructionCost PPCTTIImpl::vectorCostAdjustment(InstructionCost Cost,		InstructionCost PPCTTIImpl::vectorCostAdjustment(InstructionCost Cost,
unsigned Opcode, Type *Ty1,		unsigned Opcode, Type *Ty1,
		nemanjaiUnsubmitted Not Done Reply Inline Actions This should be returning `unsigned`. And the name should change to indicate that this is a factor. nemanjai: This should be returning `unsigned`. And the name should change to indicate that this is a…
Type *Ty2) {		Type *Ty2) {
if (!ST->vectorsUseTwoUnits() \|\| !Ty1->isVectorTy())		if (!ST->vectorsUseTwoUnits() \|\| !Ty1->isVectorTy())
return Cost;		return Cost;
		nemanjaiUnsubmitted Not Done Reply Inline Actions This comment is now out of date. Please update it on the commit. nemanjai: This comment is now out of date. Please update it on the commit.
		amykAuthorUnsubmitted Done Reply Inline Actions Thanks Nemanja for the review. I'll update it on the commit. amyk: Thanks Nemanja for the review. I'll update it on the commit.

std::pair<InstructionCost, MVT> LT1 = TLI->getTypeLegalizationCost(DL, Ty1);		std::pair<InstructionCost, MVT> LT1 = TLI->getTypeLegalizationCost(DL, Ty1);
// If type legalization involves splitting the vector, we don't want to		// If type legalization involves splitting the vector, we don't want to
// double the cost at every step - only the last step.		// double the cost at every step - only the last step.
if (LT1.first != 1 \|\| !LT1.second.isVector())		if (LT1.first != 1 \|\| !LT1.second.isVector())
return Cost;		return Cost;

int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
Show All 17 Lines	InstructionCost PPCTTIImpl::getArithmeticInstrCost(
const Instruction *CxtI) {		const Instruction *CxtI) {
assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");		assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");
// TODO: Handle more cost kinds.		// TODO: Handle more cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,		return BaseT::getArithmeticInstrCost(Opcode, Ty, CostKind, Op1Info,
Op2Info, Opd1PropInfo,		Op2Info, Opd1PropInfo,
Opd2PropInfo, Args, CxtI);		Opd2PropInfo, Args, CxtI);

		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(Ty))
		nemanjaiUnsubmitted Not Done Reply Inline Actions Can you please explain why we don't want to do this check in the above condition? nemanjai: Can you please explain why we don't want to do this check in the above condition?
		return InstructionCost::getMax();

// Fallback to the default implementation.		// Fallback to the default implementation.
InstructionCost Cost = BaseT::getArithmeticInstrCost(		InstructionCost Cost = BaseT::getArithmeticInstrCost(
Opcode, Ty, CostKind, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);		Opcode, Ty, CostKind, Op1Info, Op2Info, Opd1PropInfo, Opd2PropInfo);
return vectorCostAdjustment(Cost, Opcode, Ty, nullptr);		return vectorCostAdjustment(Cost, Opcode, Ty, nullptr);
}		}

InstructionCost PPCTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp,		InstructionCost PPCTTIImpl::getShuffleCost(TTI::ShuffleKind Kind, Type *Tp,
ArrayRef<int> Mask, int Index,		ArrayRef<int> Mask, int Index,
Type *SubTp) {		Type *SubTp) {
// Legalize the type.		// Legalize the type.
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);

		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(Tp))
		return InstructionCost::getMax();

// PPC, for both Altivec/VSX, support cheap arbitrary permutations		// PPC, for both Altivec/VSX, support cheap arbitrary permutations
// (at least in the sense that there need only be one non-loop-invariant		// (at least in the sense that there need only be one non-loop-invariant
// instruction). We need one such shuffle instruction for each actual		// instruction). We need one such shuffle instruction for each actual
// register (this is not true for arbitrary shuffles, but is true for the		// register (this is not true for arbitrary shuffles, but is true for the
// structured types of shuffles covered by TTI::ShuffleKind).		// structured types of shuffles covered by TTI::ShuffleKind).
return vectorCostAdjustment(LT.first, Instruction::ShuffleVector, Tp,		return vectorCostAdjustment(LT.first, Instruction::ShuffleVector, Tp,
nullptr);		nullptr);
}		}

InstructionCost PPCTTIImpl::getCFInstrCost(unsigned Opcode,		InstructionCost PPCTTIImpl::getCFInstrCost(unsigned Opcode,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Opcode == Instruction::PHI ? 0 : 1;		return Opcode == Instruction::PHI ? 0 : 1;
// Branches are assumed to be predicted.		// Branches are assumed to be predicted.
return 0;		return 0;
}		}

		leiUnsubmitted Not Done Reply Inline Actions I don't think this is needed.... lei: I don't think this is needed....
InstructionCost PPCTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,		InstructionCost PPCTTIImpl::getCastInstrCost(unsigned Opcode, Type *Dst,
Type *Src,		Type *Src,
TTI::CastContextHint CCH,		TTI::CastContextHint CCH,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");		assert(TLI->InstructionOpcodeToISD(Opcode) && "Invalid opcode");

		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(Dst))
		return InstructionCost::getMax();

InstructionCost Cost =		InstructionCost Cost =
BaseT::getCastInstrCost(Opcode, Dst, Src, CCH, CostKind, I);		BaseT::getCastInstrCost(Opcode, Dst, Src, CCH, CostKind, I);
Cost = vectorCostAdjustment(Cost, Opcode, Dst, Src);		Cost = vectorCostAdjustment(Cost, Opcode, Dst, Src);
// TODO: Allow non-throughput costs that aren't binary.		// TODO: Allow non-throughput costs that aren't binary.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Cost == 0 ? 0 : 1;		return Cost == 0 ? 0 : 1;
return Cost;		return Cost;
}		}

InstructionCost PPCTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy,		InstructionCost PPCTTIImpl::getCmpSelInstrCost(unsigned Opcode, Type *ValTy,
Type *CondTy,		Type *CondTy,
CmpInst::Predicate VecPred,		CmpInst::Predicate VecPred,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I) {		const Instruction *I) {
		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(ValTy))
		return InstructionCost::getMax();

InstructionCost Cost =		InstructionCost Cost =
BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, VecPred, CostKind, I);		BaseT::getCmpSelInstrCost(Opcode, ValTy, CondTy, VecPred, CostKind, I);
// TODO: Handle other cost kinds.		// TODO: Handle other cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Cost;		return Cost;
return vectorCostAdjustment(Cost, Opcode, ValTy, nullptr);		return vectorCostAdjustment(Cost, Opcode, ValTy, nullptr);
}		}

InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,		InstructionCost PPCTTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
unsigned Index) {		unsigned Index) {
assert(Val->isVectorTy() && "This must be a vector type");		assert(Val->isVectorTy() && "This must be a vector type");

int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

InstructionCost Cost = BaseT::getVectorInstrCost(Opcode, Val, Index);		InstructionCost Cost = BaseT::getVectorInstrCost(Opcode, Val, Index);
Cost = vectorCostAdjustment(Cost, Opcode, Val, nullptr);		Cost = vectorCostAdjustment(Cost, Opcode, Val, nullptr);

		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(Val))
		nemanjaiUnsubmitted Not Done Reply Inline Actions If we are exiting early, might as well do it before we do the ISD/Cost computations above. nemanjai: If we are exiting early, might as well do it before we do the ISD/Cost computations above.
		return InstructionCost::getMax();

if (ST->hasVSX() && Val->getScalarType()->isDoubleTy()) {		if (ST->hasVSX() && Val->getScalarType()->isDoubleTy()) {
// Double-precision scalars are already located in index #0 (or #1 if LE).		// Double-precision scalars are already located in index #0 (or #1 if LE).
if (ISD == ISD::EXTRACT_VECTOR_ELT &&		if (ISD == ISD::EXTRACT_VECTOR_ELT &&
Index == (ST->isLittleEndian() ? 1 : 0))		Index == (ST->isLittleEndian() ? 1 : 0))
return 0;		return 0;

return Cost;		return Cost;

▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	InstructionCost PPCTTIImpl::getMemoryOpCost(unsigned Opcode, Type *Src,
if (TLI->getValueType(DL, Src, true) == MVT::Other)		if (TLI->getValueType(DL, Src, true) == MVT::Other)
return BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace,		return BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace,
CostKind);		CostKind);
// Legalize the type.		// Legalize the type.
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, Src);
assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&		assert((Opcode == Instruction::Load \|\| Opcode == Instruction::Store) &&
"Invalid Opcode");		"Invalid Opcode");

		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(Src))
		return InstructionCost::getMax();

InstructionCost Cost =		InstructionCost Cost =
BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace, CostKind);		BaseT::getMemoryOpCost(Opcode, Src, Alignment, AddressSpace, CostKind);
// TODO: Handle other cost kinds.		// TODO: Handle other cost kinds.
if (CostKind != TTI::TCK_RecipThroughput)		if (CostKind != TTI::TCK_RecipThroughput)
return Cost;		return Cost;

Cost = vectorCostAdjustment(Cost, Opcode, Src, nullptr);		Cost = vectorCostAdjustment(Cost, Opcode, Src, nullptr);

▲ Show 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	InstructionCost PPCTTIImpl::getInterleavedMemoryOpCost(
if (UseMaskForCond \|\| UseMaskForGaps)		if (UseMaskForCond \|\| UseMaskForGaps)
return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return BaseT::getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind,		Alignment, AddressSpace, CostKind,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);

assert(isa<VectorType>(VecTy) &&		assert(isa<VectorType>(VecTy) &&
"Expect a vector type for interleaved memory op");		"Expect a vector type for interleaved memory op");

		// Set the max cost if an MMA type is present (v256i1, v512i1).
		if (isMMAType(VecTy))
		return InstructionCost::getMax();

// Legalize the type.		// Legalize the type.
std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, VecTy);		std::pair<InstructionCost, MVT> LT = TLI->getTypeLegalizationCost(DL, VecTy);

// Firstly, the cost of load/store operation.		// Firstly, the cost of load/store operation.
InstructionCost Cost = getMemoryOpCost(Opcode, VecTy, MaybeAlign(Alignment),		InstructionCost Cost = getMemoryOpCost(Opcode, VecTy, MaybeAlign(Alignment),
AddressSpace, CostKind);		AddressSpace, CostKind);

// PPC, for both Altivec/VSX, support cheap arbitrary permutations		// PPC, for both Altivec/VSX, support cheap arbitrary permutations
▲ Show 20 Lines • Show All 127 Lines • Show Last 20 Lines

llvm/test/Transforms/PGOProfile/ppc-prevent-mma-types.ll

This file was added.

				; RUN: opt --vec-extabi=true -passes='default<O3>' -mcpu=pwr10 \
				; RUN: -pgo-kind=pgo-instr-gen-pipeline -mtriple=powerpc-ibm-aix -S < %s \| \
				; RUN: FileCheck %s
				; RUN: opt --vec-extabi=true -passes='default<O3>' -mcpu=pwr10 \
				; RUN: -pgo-kind=pgo-instr-gen-pipeline -mtriple=powerpc64le-unknown-linux-gnu \
				fhahnUnsubmitted Not Done Reply Inline Actions Is there a reason this test checks the full pipeline compare to just the SLP vectorizer pass? The changes look completely unrelated to PGO/profiling. fhahn: Is there a reason this test checks the full pipeline compare to just the SLP vectorizer pass?
				amykAuthorUnsubmitted Done Reply Inline Actions Thanks for taking a look at this. I originally found this test case under the full PGO pipeline. Then, I realized the SLP vectorizer was also an important culprit in reproducing the issue, but it appears to be a bit more complex than that. It seems like there are other passes that come into play when trying to reproduce this issue. I've attempted to find a minimal set of passes and I've posted a patch to hopefully be a bit more specific with the pass pipeline: https://reviews.llvm.org/D118142 I'd appreciate your opinion on the updated pass pipeline. amyk: Thanks for taking a look at this. I originally found this test case under the full PGO pipeline.
				; RUN: -S < %s \| FileCheck %s

				; When running this test case under opt + PGO, the SLPVectorizer previously had
				; an opportunity to produce wide vector types (such as <256 x i1>) within the
				; IR as it deemed these wide vector types to be cheap enough to produce.
				; Having this test ensures that the optimizer no longer generates wide vectors
				; within the IR.

				%0 = type <{ double }>
				%1 = type <{ [1 x %0]*, i8, i8, i8, i8, i32, i32, i32, [1 x i32], [1 x i32], [1 x i32], [24 x i8] }>
				declare i8* @__malloc()
				; CHECK-NOT: <256 x i1>
				; CHECK-NOT: <512 x i1>
				define dso_local void @test([0 x %0]* %arg, i32* %arg1, i32* %arg2, i32* %arg3, i32* %arg4) {
				%i = alloca [0 x %0]*, align 4
				store [0 x %0]* %arg, [0 x %0]** %i, align 4
				%i7 = alloca i32*, align 4
				store i32* %arg1, i32** %i7, align 4
				%i9 = alloca i32*, align 4
				store i32* %arg2, i32** %i9, align 4
				%i10 = alloca i32*, align 4
				store i32* %arg3, i32** %i10, align 4
				%i11 = alloca i32*, align 4
				store i32* %arg4, i32** %i11, align 4
				%i14 = alloca %1, align 4
				%i15 = alloca i32, align 4
				%i16 = alloca i32, align 4
				%i17 = alloca i32, align 4
				%i18 = alloca i32, align 4
				%i20 = alloca i32, align 4
				%i21 = alloca i32, align 4
				%i22 = alloca i32, align 4
				%i23 = alloca i32, align 4
				%i25 = alloca double, align 8
				%i26 = load i32, i32* %i9, align 4
				%i27 = load i32, i32* %i26, align 4
				%i28 = select i1 false, i32 0, i32 %i27
				store i32 %i28, i32* %i15, align 4
				%i29 = load i32, i32* %i7, align 4
				%i30 = load i32, i32* %i29, align 4
				%i31 = select i1 false, i32 0, i32 %i30
				store i32 %i31, i32* %i16, align 4
				%i32 = load i32, i32* %i15, align 4
				%i33 = mul i32 8, %i32
				store i32 %i33, i32* %i17, align 4
				%i34 = load i32, i32* %i17, align 4
				%i35 = load i32, i32* %i16, align 4
				%i36 = mul i32 %i34, %i35
				store i32 %i36, i32* %i18, align 4
				%i37 = load i32, i32* %i9, align 4
				%i38 = load i32, i32* %i37, align 4
				%i39 = select i1 false, i32 0, i32 %i38
				store i32 %i39, i32* %i22, align 4
				%i40 = load i32, i32* %i10, align 4
				%i41 = load i32, i32* %i40, align 4
				%i42 = select i1 false, i32 0, i32 %i41
				store i32 %i42, i32* %i23, align 4
				%i43 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i44 = bitcast [1 x i32]* %i43 to i8*
				%i45 = getelementptr i8, i8* %i44, i32 -12
				%i46 = getelementptr inbounds i8, i8* %i45, i32 12
				%i47 = bitcast i8* %i46 to i32*
				%i48 = load i32, i32* %i23, align 4
				%i49 = select i1 false, i32 0, i32 %i48
				%i50 = load i32, i32* %i22, align 4
				%i51 = select i1 false, i32 0, i32 %i50
				%i52 = mul i32 %i51, 8
				%i53 = mul i32 %i49, %i52
				store i32 %i53, i32* %i47, align 4
				%i54 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i55 = bitcast [1 x i32]* %i54 to i8*
				%i56 = getelementptr i8, i8* %i55, i32 -12
				%i57 = getelementptr inbounds i8, i8* %i56, i32 36
				%i58 = bitcast i8* %i57 to i32*
				store i32 8, i32* %i58, align 4
				%i60 = getelementptr inbounds %1, %1* %i14, i32 0, i32 0
				%i61 = call i8* @__malloc()
				%i62 = bitcast [1 x %0] %i60 to i8
				store i8* %i61, i8** %i62, align 4
				br label %bb63
				bb63: ; preds = %bb66, %bb
				%i64 = load i32, i32* %i11, align 4
				%i65 = load i32, i32* %i64, align 4
				br label %bb66
				bb66: ; preds = %bb165, %bb63
				%i67 = load i32, i32* %i21, align 4
				%i68 = icmp sle i32 %i67, %i65
				br i1 %i68, label %bb69, label %bb63
				bb69: ; preds = %bb66
				store i32 1, i32* %i20, align 4
				br label %bb70
				bb70: ; preds = %bb163, %bb69
				%i71 = load i32, i32* %i20, align 4
				%i72 = icmp sle i32 %i71, 11
				br i1 %i72, label %bb73, label %bb165
				bb73: ; preds = %bb70
				%i74 = load i32, i32* %i21, align 4
				%i76 = mul i32 %i74, 8
				%i77 = getelementptr inbounds i8, i8* null, i32 %i76
				%i78 = bitcast i8* %i77 to double*
				%i79 = load double, double* %i78, align 8
				%i80 = fcmp fast olt double %i79, 0.000000e+00
				%i81 = zext i1 %i80 to i32
				%i82 = trunc i32 %i81 to i1
				br i1 %i82, label %bb83, label %bb102
				bb83: ; preds = %bb73
				%i84 = getelementptr inbounds %1, %1* %i14, i32 0, i32 0
				%i85 = load [1 x %0], [1 x %0]* %i84, align 4
				%i86 = bitcast [1 x %0]* %i85 to i8*
				%i87 = getelementptr i8, i8* %i86, i32 0
				%i88 = load i32, i32* %i20, align 4
				%i89 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i90 = getelementptr inbounds [1 x i32], [1 x i32]* %i89, i32 0, i32 0
				%i91 = load i32, i32* %i90, align 4
				%i92 = mul i32 %i88, %i91
				%i93 = getelementptr inbounds i8, i8* %i87, i32 %i92
				%i94 = getelementptr inbounds i8, i8* %i93, i32 0
				%i95 = load i32, i32* %i21, align 4
				%i96 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i97 = getelementptr inbounds [1 x i32], [1 x i32]* %i96, i32 0, i32 6
				%i98 = load i32, i32* %i97, align 4
				%i99 = mul i32 %i95, %i98
				%i100 = getelementptr inbounds i8, i8* %i94, i32 %i99
				%i101 = bitcast i8* %i100 to double*
				store double 0.000000e+00, double* %i101, align 8
				br label %bb163
				bb102: ; preds = %bb73
				%i103 = getelementptr i8, i8* null, i32 -8
				%i104 = getelementptr inbounds i8, i8* %i103, i32 undef
				%i105 = bitcast i8* %i104 to double*
				%i106 = load double, double* %i105, align 8
				%i107 = load [0 x %0], [0 x %0]* %i, align 4
				%i108 = bitcast [0 x %0]* %i107 to i8*
				%i109 = getelementptr i8, i8* %i108, i32 -8
				%i110 = getelementptr inbounds i8, i8* %i109, i32 undef
				%i111 = bitcast i8* %i110 to double*
				%i112 = load double, double* %i111, align 8
				%i113 = fmul fast double %i106, %i112
				%i114 = fcmp fast ogt double 0.000000e+00, %i113
				%i115 = zext i1 %i114 to i32
				%i116 = trunc i32 %i115 to i1
				br i1 %i116, label %bb117, label %bb136
				bb117: ; preds = %bb102
				%i118 = getelementptr inbounds %1, %1* %i14, i32 0, i32 0
				%i119 = load [1 x %0], [1 x %0]* %i118, align 4
				%i120 = bitcast [1 x %0]* %i119 to i8*
				%i121 = getelementptr i8, i8* %i120, i32 0
				%i122 = load i32, i32* %i20, align 4
				%i123 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i124 = getelementptr inbounds [1 x i32], [1 x i32]* %i123, i32 0, i32 0
				%i125 = load i32, i32* %i124, align 4
				%i126 = mul i32 %i122, %i125
				%i127 = getelementptr inbounds i8, i8* %i121, i32 %i126
				%i128 = getelementptr inbounds i8, i8* %i127, i32 0
				%i129 = load i32, i32* %i21, align 4
				%i130 = getelementptr inbounds %1, %1* %i14, i32 0, i32 10
				%i131 = getelementptr inbounds [1 x i32], [1 x i32]* %i130, i32 0, i32 6
				%i132 = load i32, i32* %i131, align 4
				%i133 = mul i32 %i129, %i132
				%i134 = getelementptr inbounds i8, i8* %i128, i32 %i133
				%i135 = bitcast i8* %i134 to double*
				store double 0.000000e+00, double* %i135, align 8
				br label %bb163
				bb136: ; preds = %bb102
				%i137 = load double, double* null, align 8
				%i138 = load double, double* null, align 8
				%i139 = fmul fast double %i137, %i138
				%i140 = fsub fast double 0.000000e+00, %i139
				store double %i140, double* %i25, align 8
				%i141 = load i32, i32* %i21, align 4
				%i143 = getelementptr inbounds [1 x i32], [1 x i32]* null, i32 0, i32 6
				%i144 = load i32, i32* %i143, align 4
				%i145 = mul i32 %i141, %i144
				%i146 = getelementptr inbounds i8, i8* null, i32 %i145
				%i147 = bitcast i8* %i146 to double*
				%i148 = load i32, i32* %i20, align 4
				%i149 = load i32, i32* %i18, align 4
				%i151 = mul i32 %i148, %i149
				%i152 = getelementptr i8, i8* null, i32 %i151
				%i153 = getelementptr i8, i8* %i152, i32 0
				%i154 = getelementptr inbounds i8, i8* %i153, i32 0
				%i155 = bitcast i8* %i154 to double*
				%i156 = load double, double* %i155, align 8
				%i157 = load double, double* %i25, align 8
				%i158 = fmul fast double %i156, %i157
				%i159 = fadd fast double 0.000000e+00, %i158
				%i160 = load double, double* %i25, align 8
				%i161 = fadd fast double 0.000000e+00, %i160
				%i162 = fdiv fast double %i159, %i161
				store double %i162, double* %i147, align 8
				br label %bb163
				bb163: ; preds = %bb136, %bb117, %bb83
				%i164 = add nsw i32 %i71, 1
				store i32 %i164, i32* %i20, align 4
				br label %bb70
				bb165: ; preds = %bb70
				%i166 = add nsw i32 %i67, 1
				store i32 %i166, i32* %i21, align 4
				br label %bb66
				}