This is an archive of the discontinued LLVM Phabricator instance.

Differential D108826

[SLP][LTO][WIP]Allow full SLP in LTO only at link time.
Needs ReviewPublic

Authored by ABataev on Aug 27 2021, 9:04 AM.

Download Raw Diff

This revision needs review, but there are no reviewers specified.

Details

Reviewers: None

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Aug 27 2021, 9:04 AM

Herald added subscribers: hiraditya, inglorion. · View Herald TranscriptAug 27 2021, 9:04 AM

ABataev requested review of this revision.Aug 27 2021, 9:04 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 27 2021, 9:04 AM

Herald added subscribers: llvm-commits, cfe-commits. · View Herald Transcript

ABataev mentioned this in D103925: [X86][SSE] Support 64-bit vectorization (WIP).Aug 27 2021, 9:05 AM

Harbormaster completed remote builds in B121506: Diff 369117.Aug 27 2021, 9:43 AM

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

In D108826#2969471, @lebedev.ri wrote:

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

I saw it in Full LTO but suppose we have a similar problem in ThinLTO. SLP vectorizer at compile-time tries to vectorize using small vectors at it may affect other optimizations at link time (e.g. after inlining we may try to vectorize using large vector sizes etc.). This is just a preliminary attempt to see how can we fix this early optimization in SLP.

In D108826#2969547, @ABataev wrote:

In D108826#2969471, @lebedev.ri wrote:

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

I saw it in Full LTO but suppose we have a similar problem in ThinLTO. SLP vectorizer at compile-time tries to vectorize using small vectors at it may affect other optimizations at link time (e.g. after inlining we may try to vectorize using large vector sizes etc.). This is just a preliminary attempt to see how can we fix this early optimization in SLP.

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002
IMO workarounding it in the pass isn't the right course of action. Such workarounds tend to stick around.

In D108826#2969594, @lebedev.ri wrote:

In D108826#2969547, @ABataev wrote:

In D108826#2969471, @lebedev.ri wrote:

I think there is something really wrong with vectorzer passes in LTO pipelines.
Can you say whether the problem you are observing is in ThinLTO, Full LTO, or both?

I saw it in Full LTO but suppose we have a similar problem in ThinLTO. SLP vectorizer at compile-time tries to vectorize using small vectors at it may affect other optimizations at link time (e.g. after inlining we may try to vectorize using large vector sizes etc.). This is just a preliminary attempt to see how can we fix this early optimization in SLP.

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002

Aha, do I understand correctly that he tries to add a flag(s) that we have a compile without LTO, compile at LTO and link at LTO? Or something else? Or he just tries to reorder passes depending whether we're in LTO or not in LTO?

IMO workarounding it in the pass isn't the right course of action. Such workarounds tend to stick around.

I agree, I thought about a pipeline fix, this is just a temp solution to check how it affects the performance. It gets important especially for upcoming non-power-2 vectorization, which may cause regressions with LTO.

In D108826#2969604, @ABataev wrote:

In D108826#2969594, @lebedev.ri wrote:

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002

Aha, do I understand correctly that he tries to add a flag(s) that we have a compile without LTO, compile at LTO and link at LTO? Or something else? Or he just tries to reorder passes depending whether we're in LTO or not in LTO?

We found that there were differences between regular and LTO for the passes invoked, their orderings, and parameters used to enable extra optimizations. (There was also inconsistency between new and old pass manager, but we can probably just focus on NPM now.)
I suspect that almost none of those differences were intentional - people just made changes for whatever pipeline they were interested in at the time and didn't realize there was divergence.
So we now have things refactored with this note:

/// TODO: Should LTO cause any differences to this set of passes?
void PassBuilder::addVectorPasses(OptimizationLevel Level,
                                  FunctionPassManager &FPM, bool IsFullLTO) {

So if there really is a reason for something to be different with LTO, it's set up to make that easily visible at least. :)
I made a couple of small fixes in there already, but basically any place where we do something differently for FullLTO should be investigated.

In D108826#2969677, @spatel wrote:
In D108826#2969604, @ABataev wrote:

In D108826#2969594, @lebedev.ri wrote:

Aha, so full lto. That is consistent with the phase ordering dilemma @spatel discovered: D102002

Aha, do I understand correctly that he tries to add a flag(s) that we have a compile without LTO, compile at LTO and link at LTO? Or something else? Or he just tries to reorder passes depending whether we're in LTO or not in LTO?

We found that there were differences between regular and LTO for the passes invoked, their orderings, and parameters used to enable extra optimizations. (There was also inconsistency between new and old pass manager, but we can probably just focus on NPM now.)
I suspect that almost none of those differences were intentional - people just made changes for whatever pipeline they were interested in at the time and didn't realize there was divergence.
So we now have things refactored with this note:
/// TODO: Should LTO cause any differences to this set of passes?
void PassBuilder::addVectorPasses(OptimizationLevel Level,
                                  FunctionPassManager &FPM, bool IsFullLTO) {
So if there really is a reason for something to be different with LTO, it's set up to make that easily visible at least. :)
I made a couple of small fixes in there already, but basically any place where we do something differently for FullLTO should be investigated.

Do I understand correctly that your patch just reorders passes? Because I need a bit different. We need to run SLP only for the widest possible VF at compile time and run SLP at full (for all possible VFs) only at link time.
Early optimization using small VF may affect other passes(alias analysis, loads/stores/allocas elimination, Loop Vectorization, SLP itself, etc.) and we need to run SLP at full only at link time.

@ABataev The pipeline already distinguishes between pre-link and post-link optimization pipelines, see e.g. the flag that gets passed to LoopRotate to control rotation of loops with calls (https://github.com/llvm/llvm-project/blob/2f69c82cec1ae05b4fdcef4ac48f48e9e2bad32b/llvm/lib/Passes/PassBuilder.cpp#L760). You'd probably want to do something similar here.

Though TBH I'm surprised that we perform vectorization in the pre-link pipelines at all, I'd have assumed that this only gets done in the LTO step.

In D108826#2969701, @nikic wrote:

@ABataev The pipeline already distinguishes between pre-link and post-link optimization pipelines, see e.g. the flag that gets passed to LoopRotate to control rotation of loops with calls (https://github.com/llvm/llvm-project/blob/2f69c82cec1ae05b4fdcef4ac48f48e9e2bad32b/llvm/lib/Passes/PassBuilder.cpp#L760). You'd probably want to do something similar here.

Probably, will check, thanks for the link.

Though TBH I'm surprised that we perform vectorization in the pre-link pipelines at all, I'd have assumed that this only gets done in the LTO step.

No, also at the compile time. Most probably, just nobody looked at it.

dtemirbulatov added a subscriber: dtemirbulatov.Jun 16 2022, 1:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2022, 1:12 AM

Herald added subscribers: vporpo, ormris, MaskRay. · View Herald Transcript

Revision Contents

Path

Size

clang/

lib/

Driver/

ToolChains/

Clang.cpp

7 lines

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

11 lines

Diff 369117

clang/lib/Driver/ToolChains/Clang.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,435 Lines • ▼ Show 20 Lines	if (Args.hasFlag(options::OPT_fvectorize, VectorizeAliasOption,
options::OPT_fno_vectorize, EnableVec))		options::OPT_fno_vectorize, EnableVec))
CmdArgs.push_back("-vectorize-loops");		CmdArgs.push_back("-vectorize-loops");

// -fslp-vectorize is enabled based on the optimization level selected.		// -fslp-vectorize is enabled based on the optimization level selected.
bool EnableSLPVec = shouldEnableVectorizerAtOLevel(Args, true);		bool EnableSLPVec = shouldEnableVectorizerAtOLevel(Args, true);
OptSpecifier SLPVectAliasOption =		OptSpecifier SLPVectAliasOption =
EnableSLPVec ? options::OPT_O_Group : options::OPT_fslp_vectorize;		EnableSLPVec ? options::OPT_O_Group : options::OPT_fslp_vectorize;
if (Args.hasFlag(options::OPT_fslp_vectorize, SLPVectAliasOption,		if (Args.hasFlag(options::OPT_fslp_vectorize, SLPVectAliasOption,
options::OPT_fno_slp_vectorize, EnableSLPVec))		options::OPT_fno_slp_vectorize, EnableSLPVec)) {
CmdArgs.push_back("-vectorize-slp");		CmdArgs.push_back("-vectorize-slp");
		if (IsUsingLTO) {
		CmdArgs.push_back("-mllvm");
		CmdArgs.push_back("-slp-limit-to-reg-size");
		}
		}

ParseMPreferVectorWidth(D, Args, CmdArgs);		ParseMPreferVectorWidth(D, Args, CmdArgs);

Args.AddLastArg(CmdArgs, options::OPT_fshow_overloads_EQ);		Args.AddLastArg(CmdArgs, options::OPT_fshow_overloads_EQ);
Args.AddLastArg(CmdArgs,		Args.AddLastArg(CmdArgs,
options::OPT_fsanitize_undefined_strip_path_components_EQ);		options::OPT_fsanitize_undefined_strip_path_components_EQ);

// -fdollars-in-identifiers default varies depending on platform and		// -fdollars-in-identifiers default varies depending on platform and
▲ Show 20 Lines • Show All 1,397 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
// The Look-ahead heuristic goes through the users of the bundle to calculate		// The Look-ahead heuristic goes through the users of the bundle to calculate
// the users cost in getExternalUsesCost(). To avoid compilation time increase		// the users cost in getExternalUsesCost(). To avoid compilation time increase
// we limit the number of users visited to this value.		// we limit the number of users visited to this value.
static cl::opt<unsigned> LookAheadUsersBudget(		static cl::opt<unsigned> LookAheadUsersBudget(
"slp-look-ahead-users-budget", cl::init(2), cl::Hidden,		"slp-look-ahead-users-budget", cl::init(2), cl::Hidden,
cl::desc("The maximum number of users to visit while visiting the "		cl::desc("The maximum number of users to visit while visiting the "
"predecessors. This prevents compilation time increase."));		"predecessors. This prevents compilation time increase."));

		static cl::opt<bool> SLPLimitToRegSize(
		"slp-limit-to-reg-size", cl::init(false), cl::Hidden,
		cl::desc("Try to vectorize using only maximal vector register size."));

static cl::opt<bool>		static cl::opt<bool>
ViewSLPTree("view-slp-tree", cl::Hidden,		ViewSLPTree("view-slp-tree", cl::Hidden,
cl::desc("Display the SLP trees with Graphviz"));		cl::desc("Display the SLP trees with Graphviz"));

// Limit the number of alias checks. The limit is chosen so that		// Limit the number of alias checks. The limit is chosen so that
// it has no negative effect on the llvm benchmarks.		// it has no negative effect on the llvm benchmarks.
static const unsigned AliasedCheckLimit = 10;		static const unsigned AliasedCheckLimit = 10;

▲ Show 20 Lines • Show All 7,265 Lines • ▼ Show 20 Lines
bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,		bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
unsigned Idx) {		unsigned Idx) {
LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()		LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()
<< "\n");		<< "\n");
const unsigned Sz = R.getVectorElementSize(Chain[0]);		const unsigned Sz = R.getVectorElementSize(Chain[0]);
const unsigned MinVF = R.getMinVecRegSize() / Sz;		const unsigned MinVF = R.getMinVecRegSize() / Sz;
unsigned VF = Chain.size();		unsigned VF = Chain.size();

if (!isPowerOf2_32(Sz) \|\| !isPowerOf2_32(VF) \|\| VF < 2 \|\| VF < MinVF)		if (!isPowerOf2_32(Sz) \|\| !isPowerOf2_32(VF) \|\| VF < 2 \|\| VF < MinVF \|\|
		(SLPLimitToRegSize && VF < R.getMaxVecRegSize() / Sz))
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << Idx		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << Idx
<< "\n");		<< "\n");

R.buildTree(Chain);		R.buildTree(Chain);
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
return false;		return false;
▲ Show 20 Lines • Show All 247 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R) {
}		}

bool Changed = false;		bool Changed = false;
bool CandidateFound = false;		bool CandidateFound = false;
InstructionCost MinCost = SLPCostThreshold.getValue();		InstructionCost MinCost = SLPCostThreshold.getValue();
Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (auto *IE = dyn_cast<InsertElementInst>(VL[0]))		if (auto *IE = dyn_cast<InsertElementInst>(VL[0]))
ScalarTy = IE->getOperand(1)->getType();		ScalarTy = IE->getOperand(1)->getType();
		unsigned MaxRegSz = R.getMaxVecRegSize() / Sz;

unsigned NextInst = 0, MaxInst = VL.size();		unsigned NextInst = 0, MaxInst = VL.size();
for (unsigned VF = MaxVF; NextInst + 1 < MaxInst && VF >= MinVF; VF /= 2) {		for (unsigned VF = MaxVF; NextInst + 1 < MaxInst && VF >= MinVF; VF /= 2) {
// No actual vectorization should happen, if number of parts is the same as		// No actual vectorization should happen, if number of parts is the same as
// provided vectorization factor (i.e. the scalar type is used for vector		// provided vectorization factor (i.e. the scalar type is used for vector
// code during codegen).		// code during codegen).
auto *VecTy = FixedVectorType::get(ScalarTy, VF);		auto *VecTy = FixedVectorType::get(ScalarTy, VF);
if (TTI->getNumberOfParts(VecTy) == VF)		if (TTI->getNumberOfParts(VecTy) == VF)
continue;		continue;
for (unsigned I = NextInst; I < MaxInst; ++I) {		for (unsigned I = NextInst; I < MaxInst; ++I) {
unsigned OpsWidth = 0;		unsigned OpsWidth = 0;

if (I + VF > MaxInst)		if (I + VF > MaxInst)
OpsWidth = MaxInst - I;		OpsWidth = MaxInst - I;
else		else
OpsWidth = VF;		OpsWidth = VF;

if (!isPowerOf2_32(OpsWidth))		if (!isPowerOf2_32(OpsWidth))
continue;		continue;

if ((VF > MinVF && OpsWidth <= VF / 2) \|\| (VF == MinVF && OpsWidth < 2))		if ((SLPLimitToRegSize && OpsWidth < MaxRegSz) \|\|
		(VF > MinVF && OpsWidth <= VF / 2) \|\| (VF == MinVF && OpsWidth < 2))
break;		break;

ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);		ArrayRef<Value *> Ops = VL.slice(I, OpsWidth);
// Check that a previous iteration of this loop did not delete the Value.		// Check that a previous iteration of this loop did not delete the Value.
if (llvm::any_of(Ops, [&R](Value *V) {		if (llvm::any_of(Ops, [&R](Value *V) {
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
return I && R.isDeleted(I);		return I && R.isDeleted(I);
}))		}))
▲ Show 20 Lines • Show All 1,672 Lines • Show Last 20 Lines