This is an archive of the discontinued LLVM Phabricator instance.

Rework the LTO Pipeline, aligning closer to the O2/O3 pipeline.
Needs ReviewPublic

Authored by mehdi_amini on Oct 5 2015, 11:33 AM.

Download Raw Diff

Details

Reviewers

Summary

It seems that the LTO pipeline didn't receive as much attention as the
regular O2/O3 pipeline the last few years. This patch attempts to
refactor the pass pipeline initialization to align LTO with the
regular pipeline.

The proposed change exit the compile phase before the inliner is ran
during the compile phase. The link phase will run the usuall pipeline,
extended with some specific passes that leverage the augmented
knowledge we have from full program visibility.

The LTO addition over the regular pipeline consists mainly of re-doing
the peephole a second after the inliner is done, but prepended with a
run of GlobalDCE/GlobalOpt.

The measurement on the public test suite as well as on our internal
suite show an overall net improvement. With some regression that I
am still tracking. Consider this patch as a work in progress that I
submit now for feedback.

Diff Detail

Event Timeline

mehdi_amini updated this revision to Diff 36537.Oct 5 2015, 11:33 AM

mehdi_amini retitled this revision from to Rework the LTO Pipeline, aligning closer to the O2/O3 pipeline..

mehdi_amini updated this object.

mehdi_amini added a reviewer: chandlerc.

mehdi_amini added subscribers: llvm-commits, dexonsmith.

Minor update, moving EliminateAvailableExternallyPass before GlobalOpt

tejohnson added a subscriber: tejohnson.Oct 5 2015, 12:16 PM

tejohnson added inline comments.

lib/Transforms/IPO/PassManagerBuilder.cpp
322	Why not perform the first round of inlining and some of the other optimizations (like the peepholes) called below when preparing for LTO? I would think this would result in smaller intermediate .o bitcode sizes and more efficient LTO.

I'm not sure why inlining should be able to end up with smaller files, except in the case where an "internal linkage" function can end up with no call site. Do you see any other cases?

The inliner may take different decisions with more of the call graph available. Assuming the inliner heuristics are well tuned, running on a partial call-graph shouldn't result in better results but on the opposite only limits the freedom of it during LTO.

If your only concern is about compile time, it is fairly easy to test and I can launch a run of our compile-time test-suite.

yaron.keren added a subscriber: yaron.keren.Oct 5 2015, 12:40 PM

Teresa (and other), can you run this on your internal benchmark suite to see the performance impact?

Unfortunately, we haven't had a lot of success running LTO on our
internal benchmarks in the past due to the scaling issue.

That's great it is giving you better performance - I was just
surprised that it was beneficial to skip all of those downstream
optimizations in the initial compile step. Is the better performance
coming from skipping the initial inline? Is there no benefit to doing
the peepholes etc in the initial compile?

Right now my view of it is that if I get a performance improvement by running two times the inliner and the "peephole" passes, then it is a bug. If it is not a bug it means that the O3 pipeline is affected as well and we might run it two times there as well. Does it make sense?

I ran the LLVM benchmark suite + some internals with a return before and after the inliner+peephole phase. Stopping before the inliner during the compile phase ends up with 13 regressions and 20 improvements, compared to running the inliner during the compile phase. I sent you some more details by email.

Hi Mehdi,

Thanks for sharing the results. As you note there are swings in both
directions, but the improvements outweigh the regressions.

tejohnson added inline comments.Oct 6 2015, 7:09 AM

include/llvm/Transforms/IPO/PassManagerBuilder.h
149	Can this be removed? I think it is dead code now.
lib/Transforms/IPO/PassManagerBuilder.cpp
581	This can be removed - it only needs to be called once and now the LTO pipeline calls it via addLTOOptimizationPasses.

mcrosier added a subscriber: mcrosier.Oct 6 2015, 7:34 AM

mehdi_amini added inline comments.Oct 6 2015, 9:21 AM

include/llvm/Transforms/IPO/PassManagerBuilder.h
149	Yes, the patch is not ready for review. I posted it for feedback on the approach as a work in progress, and hoping you (google) and others may have other benchmarks to run.

Rebase

Herald added a subscriber: mehdi_amini. · View Herald TranscriptDec 14 2015, 10:57 AM

mehdi_amini mentioned this in D17115: Define the ThinLTO Pipeline.Feb 10 2016, 4:59 PM

Diffusion mentioned this in rL260604: Define the ThinLTO Pipeline.Feb 11 2016, 2:04 PM

Diffusion mentioned this in rL261029: Define the ThinLTO Pipeline (experimental).Feb 16 2016, 3:07 PM

davide added a subscriber: davide.Jan 16 2017, 10:53 AM

Revision Contents

Path

Size

include/

llvm/

Transforms/

IPO/

PassManagerBuilder.h

2 lines

lib/

Transforms/

IPO/

PassManagerBuilder.cpp

164 lines

Diff 36537

include/llvm/Transforms/IPO/PassManagerBuilder.h

Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	public:
bool LoopVectorize;		bool LoopVectorize;
bool RerollLoops;		bool RerollLoops;
bool LoadCombine;		bool LoadCombine;
bool DisableGVNLoadPRE;		bool DisableGVNLoadPRE;
bool VerifyInput;		bool VerifyInput;
bool VerifyOutput;		bool VerifyOutput;
bool MergeFunctions;		bool MergeFunctions;
bool PrepareForLTO;		bool PrepareForLTO;
		bool PerformLTO;

private:		private:
/// ExtensionList - This is list of all of the extensions that are registered.		/// ExtensionList - This is list of all of the extensions that are registered.
std::vector<std::pair<ExtensionPointTy, ExtensionFn> > Extensions;		std::vector<std::pair<ExtensionPointTy, ExtensionFn> > Extensions;

public:		public:
PassManagerBuilder();		PassManagerBuilder();
~PassManagerBuilder();		~PassManagerBuilder();
/// Adds an extension that will be used by all PassManagerBuilder instances.		/// Adds an extension that will be used by all PassManagerBuilder instances.
/// This is intended to be used by plugins, to register a set of		/// This is intended to be used by plugins, to register a set of
/// optimisations to run automatically.		/// optimisations to run automatically.
static void addGlobalExtension(ExtensionPointTy Ty, ExtensionFn Fn);		static void addGlobalExtension(ExtensionPointTy Ty, ExtensionFn Fn);
void addExtension(ExtensionPointTy Ty, ExtensionFn Fn);		void addExtension(ExtensionPointTy Ty, ExtensionFn Fn);

private:		private:
void addExtensionsToPM(ExtensionPointTy ETy,		void addExtensionsToPM(ExtensionPointTy ETy,
legacy::PassManagerBase &PM) const;		legacy::PassManagerBase &PM) const;
void addInitialAliasAnalysisPasses(legacy::PassManagerBase &PM) const;		void addInitialAliasAnalysisPasses(legacy::PassManagerBase &PM) const;
void addLTOOptimizationPasses(legacy::PassManagerBase &PM);		void addLTOOptimizationPasses(legacy::PassManagerBase &PM);
		tejohnsonUnsubmitted Not Done Reply Inline Actions Can this be removed? I think it is dead code now. tejohnson: Can this be removed? I think it is dead code now.
		mehdi_aminiAuthorUnsubmitted Not Done Reply Inline Actions Yes, the patch is not ready for review. I posted it for feedback on the approach as a work in progress, and hoping you (google) and others may have other benchmarks to run. mehdi_amini: Yes, the patch is not ready for review. I posted it for feedback on the approach as a work in…
void addLateLTOOptimizationPasses(legacy::PassManagerBase &PM);		void addLateLTOOptimizationPasses(legacy::PassManagerBase &PM);
		void addPeepholePasses(legacy::PassManagerBase &MPM);

public:		public:
/// populateFunctionPassManager - This fills in the function pass manager,		/// populateFunctionPassManager - This fills in the function pass manager,
/// which is expected to be run on each function immediately as it is		/// which is expected to be run on each function immediately as it is
/// generated. The idea is to reduce the size of the IR in memory.		/// generated. The idea is to reduce the size of the IR in memory.
void populateFunctionPassManager(legacy::FunctionPassManager &FPM);		void populateFunctionPassManager(legacy::FunctionPassManager &FPM);

/// populateModulePassManager - This sets up the primary pass manager.		/// populateModulePassManager - This sets up the primary pass manager.
Show All 17 Lines

lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	PassManagerBuilder::PassManagerBuilder() {
LoopVectorize = RunLoopVectorization;		LoopVectorize = RunLoopVectorization;
RerollLoops = RunLoopRerolling;		RerollLoops = RunLoopRerolling;
LoadCombine = RunLoadCombine;		LoadCombine = RunLoadCombine;
DisableGVNLoadPRE = false;		DisableGVNLoadPRE = false;
VerifyInput = false;		VerifyInput = false;
VerifyOutput = false;		VerifyOutput = false;
MergeFunctions = false;		MergeFunctions = false;
PrepareForLTO = false;		PrepareForLTO = false;
		PerformLTO = false;
}		}

PassManagerBuilder::~PassManagerBuilder() {		PassManagerBuilder::~PassManagerBuilder() {
delete LibraryInfo;		delete LibraryInfo;
delete Inliner;		delete Inliner;
}		}

/// Set of global extensions, automatically added as part of the standard set.		/// Set of global extensions, automatically added as part of the standard set.
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	void PassManagerBuilder::populateFunctionPassManager(
if (UseNewSROA)		if (UseNewSROA)
FPM.add(createSROAPass());		FPM.add(createSROAPass());
else		else
FPM.add(createScalarReplAggregatesPass());		FPM.add(createScalarReplAggregatesPass());
FPM.add(createEarlyCSEPass());		FPM.add(createEarlyCSEPass());
FPM.add(createLowerExpectIntrinsicPass());		FPM.add(createLowerExpectIntrinsicPass());
}		}

void PassManagerBuilder::populateModulePassManager(		void PassManagerBuilder::addPeepholePasses(legacy::PassManagerBase &MPM) {
legacy::PassManagerBase &MPM) {
// If all optimizations are disabled, just run the always-inline pass and,
// if enabled, the function merging pass.
if (OptLevel == 0) {
if (Inliner) {
MPM.add(Inliner);
Inliner = nullptr;
}

// FIXME: The BarrierNoopPass is a HACK! The inliner pass above implicitly
// creates a CGSCC pass manager, but we don't want to add extensions into
// that pass manager. To prevent this we insert a no-op module pass to reset
// the pass manager to get the same behavior as EP_OptimizerLast in non-O0
// builds. The function merging pass is
if (MergeFunctions)
MPM.add(createMergeFunctionsPass());
else if (!GlobalExtensions->empty() \|\| !Extensions.empty())
MPM.add(createBarrierNoopPass());

addExtensionsToPM(EP_EnabledOnOptLevel0, MPM);
return;
}

// Add LibraryInfo if we have some.
if (LibraryInfo)
MPM.add(new TargetLibraryInfoWrapperPass(*LibraryInfo));

addInitialAliasAnalysisPasses(MPM);

if (!DisableUnitAtATime) {
addExtensionsToPM(EP_ModuleOptimizerEarly, MPM);

MPM.add(createIPSCCPPass()); // IP SCCP
MPM.add(createGlobalOptimizerPass()); // Optimize out global vars

MPM.add(createDeadArgEliminationPass()); // Dead argument elimination

MPM.add(createInstructionCombiningPass());// Clean up after IPCP & DAE
addExtensionsToPM(EP_Peephole, MPM);
MPM.add(createCFGSimplificationPass()); // Clean up after IPCP & DAE
}

if (EnableNonLTOGlobalsModRef)
// We add a module alias analysis pass here. In part due to bugs in the
// analysis infrastructure this "works" in that the analysis stays alive
// for the entire SCC pass run below.
MPM.add(createGlobalsAAWrapperPass());

// Start of CallGraph SCC passes.
if (!DisableUnitAtATime)
MPM.add(createPruneEHPass()); // Remove dead EH info
if (Inliner) {
MPM.add(Inliner);
Inliner = nullptr;
}
if (!DisableUnitAtATime)
MPM.add(createFunctionAttrsPass()); // Set readonly/readnone attrs
if (OptLevel > 2)
MPM.add(createArgumentPromotionPass()); // Scalarize uninlined fn args

// Start of function pass.		// Start of function pass.
// Break up aggregate allocas, using SSAUpdater.		// Break up aggregate allocas, using SSAUpdater.
if (UseNewSROA)		if (UseNewSROA)
MPM.add(createSROAPass());		MPM.add(createSROAPass());
else		else
MPM.add(createScalarReplAggregatesPass(-1, false));		MPM.add(createScalarReplAggregatesPass(-1, false));
MPM.add(createEarlyCSEPass()); // Catch trivial redundancies		MPM.add(createEarlyCSEPass()); // Catch trivial redundancies
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	void PassManagerBuilder::addPeepholePasses(legacy::PassManagerBase &MPM) {

if (LoadCombine)		if (LoadCombine)
MPM.add(createLoadCombinePass());		MPM.add(createLoadCombinePass());

MPM.add(createAggressiveDCEPass()); // Delete dead instructions		MPM.add(createAggressiveDCEPass()); // Delete dead instructions
MPM.add(createCFGSimplificationPass()); // Merge & remove BBs		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
MPM.add(createInstructionCombiningPass()); // Clean up after everything.		MPM.add(createInstructionCombiningPass()); // Clean up after everything.
addExtensionsToPM(EP_Peephole, MPM);		addExtensionsToPM(EP_Peephole, MPM);
		}

		void PassManagerBuilder::populateModulePassManager(
		legacy::PassManagerBase &MPM) {
		// If all optimizations are disabled, just run the always-inline pass and,
		// if enabled, the function merging pass.
		if (OptLevel == 0) {
		if (Inliner) {
		MPM.add(Inliner);
		Inliner = nullptr;
		}

		// FIXME: The BarrierNoopPass is a HACK! The inliner pass above implicitly
		// creates a CGSCC pass manager, but we don't want to add extensions into
		// that pass manager. To prevent this we insert a no-op module pass to reset
		// the pass manager to get the same behavior as EP_OptimizerLast in non-O0
		// builds. The function merging pass is
		if (MergeFunctions)
		MPM.add(createMergeFunctionsPass());
		else if (!GlobalExtensions->empty() \|\| !Extensions.empty())
		MPM.add(createBarrierNoopPass());

		addExtensionsToPM(EP_EnabledOnOptLevel0, MPM);
		return;
		}

		// Add LibraryInfo if we have some.
		if (LibraryInfo)
		MPM.add(new TargetLibraryInfoWrapperPass(*LibraryInfo));

		addInitialAliasAnalysisPasses(MPM);

		if (!DisableUnitAtATime) {
		addExtensionsToPM(EP_ModuleOptimizerEarly, MPM);

		MPM.add(createIPSCCPPass()); // IP SCCP
		MPM.add(createGlobalOptimizerPass()); // Optimize out global vars

		if (PerformLTO)
		// Linking modules together can lead to duplicated global constants, only
		// keep one copy of each constant.
		MPM.add(createConstantMergePass());

		MPM.add(createDeadArgEliminationPass()); // Dead argument elimination

		MPM.add(createInstructionCombiningPass()); // Clean up after IPCP & DAE
		addExtensionsToPM(EP_Peephole, MPM);
		MPM.add(createCFGSimplificationPass()); // Clean up after IPCP & DAE
		}

		// If we are planning to perform LTO later, let's not bloat the code with
		// unrolling/vectorization/... now. We'll run the inliner + CGSCC passes
		// during LTO and performs the rest of the optimizations afterward.
		if (PrepareForLTO)
		tejohnsonUnsubmitted Not Done Reply Inline Actions Why not perform the first round of inlining and some of the other optimizations (like the peepholes) called below when preparing for LTO? I would think this would result in smaller intermediate .o bitcode sizes and more efficient LTO. tejohnson: Why not perform the first round of inlining and some of the other optimizations (like the…
		return;

		if (EnableNonLTOGlobalsModRef)
		// We add a module alias analysis pass here. In part due to bugs in the
		// analysis infrastructure this "works" in that the analysis stays alive
		// for the entire SCC pass run below.
		MPM.add(createGlobalsAAWrapperPass());

		// Start of CallGraph SCC passes.
		if (PerformLTO \|\| !DisableUnitAtATime)
		MPM.add(createPruneEHPass()); // Remove dead EH info
		if (Inliner) {
		MPM.add(Inliner);
		Inliner = nullptr;
		}
		if (!DisableUnitAtATime)
		MPM.add(createFunctionAttrsPass()); // Set readonly/readnone attrs
		if (PerformLTO \|\| OptLevel > 2)
		MPM.add(createArgumentPromotionPass()); // Scalarize uninlined fn args

		addPeepholePasses(MPM);

// FIXME: This is a HACK! The inliner pass above implicitly creates a CGSCC		// FIXME: This is a HACK! The inliner pass above implicitly creates a CGSCC
// pass manager that we are specifically trying to avoid. To prevent this		// pass manager that we are specifically trying to avoid. To prevent this
// we must insert a no-op module pass to reset the pass manager.		// we must insert a no-op module pass to reset the pass manager.
MPM.add(createBarrierNoopPass());		MPM.add(createBarrierNoopPass());

if (!DisableUnitAtATime && OptLevel > 1 && !PrepareForLTO) {		if (PerformLTO) {
// Remove avail extern fns and globals definitions if we aren't		// Remove dead fns and globals. Removing unreferenced functions could lead
// compiling an object file for later LTO. For LTO we want to preserve		// to more opportunities for globalopt
// these so they are eligible for inlining at link-time. Note if they		MPM.add(createGlobalDCEPass());
// are unreferenced they will be removed by GlobalDCE later, so		MPM.add(createGlobalOptimizerPass());
// this only impacts referenced available externally globals.		// Remove dead fns and globals after globalopt
// Eventually they will be suppressed during codegen, but eliminating		MPM.add(createGlobalDCEPass());
// here enables more opportunity for GlobalDCE as it may make		addPeepholePasses(MPM);
// globals referenced by available external functions dead
// and saves running remaining passes on the eliminated functions.
MPM.add(createEliminateAvailableExternallyPass());
}		}

		MPM.add(createEliminateAvailableExternallyPass());

if (EnableNonLTOGlobalsModRef)		if (EnableNonLTOGlobalsModRef)
// We add a fresh GlobalsModRef run at this point. This is particularly		// We add a fresh GlobalsModRef run at this point. This is particularly
// useful as the above will have inlined, DCE'ed, and function-attr		// useful as the above will have inlined, DCE'ed, and function-attr
// propagated everything. We should at this point have a reasonably minimal		// propagated everything. We should at this point have a reasonably minimal
// and richly annotated call graph. By computing aliasing and mod/ref		// and richly annotated call graph. By computing aliasing and mod/ref
// information for all local globals here, the late loop passes and notably		// information for all local globals here, the late loop passes and notably
// the vectorizer will be able to use them to help recognize vectorizable		// the vectorizer will be able to use them to help recognize vectorizable
// memory operations.		// memory operations.
▲ Show 20 Lines • Show All 203 Lines • ▼ Show 20 Lines
}		}

void PassManagerBuilder::addLateLTOOptimizationPasses(		void PassManagerBuilder::addLateLTOOptimizationPasses(
legacy::PassManagerBase &PM) {		legacy::PassManagerBase &PM) {
// Delete basic blocks, which optimization passes may have killed.		// Delete basic blocks, which optimization passes may have killed.
PM.add(createCFGSimplificationPass());		PM.add(createCFGSimplificationPass());

// Drop bodies of available externally objects to improve GlobalDCE.		// Drop bodies of available externally objects to improve GlobalDCE.
PM.add(createEliminateAvailableExternallyPass());		PM.add(createEliminateAvailableExternallyPass());
		tejohnsonUnsubmitted Not Done Reply Inline Actions This can be removed - it only needs to be called once and now the LTO pipeline calls it via addLTOOptimizationPasses. tejohnson: This can be removed - it only needs to be called once and now the LTO pipeline calls it via…

// Now that we have optimized the program, discard unreachable functions.		// Now that we have optimized the program, discard unreachable functions.
PM.add(createGlobalDCEPass());		PM.add(createGlobalDCEPass());

// FIXME: this is profitable (for compiler time) to do at -O0 too, but		// FIXME: this is profitable (for compiler time) to do at -O0 too, but
// currently it damages debug info.		// currently it damages debug info.
if (MergeFunctions)		if (MergeFunctions)
PM.add(createMergeFunctionsPass());		PM.add(createMergeFunctionsPass());
}		}

void PassManagerBuilder::populateLTOPassManager(legacy::PassManagerBase &PM) {		void PassManagerBuilder::populateLTOPassManager(legacy::PassManagerBase &PM) {
if (LibraryInfo)		PerformLTO = true;
PM.add(new TargetLibraryInfoWrapperPass(*LibraryInfo));

if (VerifyInput)		if (VerifyInput)
PM.add(createVerifierPass());		PM.add(createVerifierPass());

if (OptLevel > 1)		if (OptLevel > 1)
addLTOOptimizationPasses(PM);		populateModulePassManager(PM);

// Lower bit sets to globals. This pass supports Clang's control flow		// Lower bit sets to globals. This pass supports Clang's control flow
// integrity mechanisms (-fsanitize=cfi*) and needs to run at link time if CFI		// integrity mechanisms (-fsanitize=cfi*) and needs to run at link time if CFI
// is enabled. The pass does nothing if CFI is disabled.		// is enabled. The pass does nothing if CFI is disabled.
PM.add(createLowerBitSetsPass());		PM.add(createLowerBitSetsPass());

if (OptLevel != 0)		if (OptLevel != 0)
addLateLTOOptimizationPasses(PM);		addLateLTOOptimizationPasses(PM);

if (VerifyOutput)		if (VerifyOutput)
PM.add(createVerifierPass());		PM.add(createVerifierPass());
		PerformLTO = false;
}		}

inline PassManagerBuilder *unwrap(LLVMPassManagerBuilderRef P) {		inline PassManagerBuilder *unwrap(LLVMPassManagerBuilderRef P) {
return reinterpret_cast<PassManagerBuilder*>(P);		return reinterpret_cast<PassManagerBuilder*>(P);
}		}

inline LLVMPassManagerBuilderRef wrap(PassManagerBuilder *P) {		inline LLVMPassManagerBuilderRef wrap(PassManagerBuilder *P) {
return reinterpret_cast<LLVMPassManagerBuilderRef>(P);		return reinterpret_cast<LLVMPassManagerBuilderRef>(P);
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines