This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Passes/
2/2
PassBuilder.cpp
-
Transforms/IPO/
-
IPO/
-
PassManagerBuilder.cpp
-
test/
-
Other/
-
new-pm-lto-defaults.ll
3
opt-LTO-pipeline.ll
-
Transforms/PhaseOrdering/
-
PhaseOrdering/
-
lto-licm.ll

Differential D100802

[PassManager] add late LICM to LTO pipeline to undo InstCombine sinking
Changes PlannedPublic

Authored by spatel on Apr 19 2021, 3:06 PM.

Download Raw Diff

Details

Reviewers

craig.topper
nikic
RKSimon
xbolva00
asbirlea
MaskRay
dmgreen
lebedev.ri

Summary

This is an alternative to D87479 (where the proposed change was to InstCombine).
InstCombine sinks code without regard to cost/loops, so we don't want that to be near the final step in the opt pipeline. In the example test, an expensive fdiv gets hoisted out of a loop.

I don't have any guess about the impact this has on compile-time, or if we can position the extra LICM somewhere else to make it better/cheaper. The regular (non-LTO) pipeline already has a late LICM, so we don't have this problem with plain -O? compiles.

Diff Detail

Event Timeline

spatel created this revision.Apr 19 2021, 3:06 PM

Herald added subscribers: asbirlea, steven_wu, hiraditya and 2 others. · View Herald TranscriptApr 19 2021, 3:06 PM

spatel requested review of this revision.Apr 19 2021, 3:06 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 19 2021, 3:06 PM

Harbormaster completed remote builds in B99577: Diff 338650.Apr 19 2021, 4:05 PM

Thanks.
I think this is the right path forward, but i think we could use some motivational perf numbers.

I've kicked off https://llvm-compile-time-tracker.com/compare.php?from=0a671c48950a4e9ff962208837169ea077e39b51&to=9a88c0161d92413616c268abcc6f6a1b23873449&stat=instructions

In D100802#2700532, @lebedev.ri wrote:

I've kicked off https://llvm-compile-time-tracker.com/compare.php?from=0a671c48950a4e9ff962208837169ea077e39b51&to=9a88c0161d92413616c268abcc6f6a1b23873449&stat=instructions

https://llvm-compile-time-tracker.com/compare.php?from=9430efa18b02e7a3f453793e48c96d5c954ed751&to=9a88c0161d92413616c268abcc6f6a1b23873449&stat=instructions

So it's within the noise, except for +1% regression for full LTO.

llvm/lib/Passes/PassBuilder.cpp
1843–1847	Can you try something like this, ...
1848
llvm/test/Other/opt-LTO-pipeline.ll
181	This one is probably the most costly here.
189	..., i wonder if we could at least inline it into the `FunctionPass Manager` above

Patch updated:
Add explicit late LoopPassManager for LICM and disable MemorySSA usage from LICM pass on NewPM.

Hopefully, that matches what was suggested in the inline comment. Let me know if not. I should learn more about how the new PM works...
It's not clear to me what the analogous change would look like for OldPM - set the Cap options to zero?

lebedev.ri added inline comments.Apr 20 2021, 6:09 AM

llvm/test/Other/opt-LTO-pipeline.ll
179–190	The problem here is that we do this after the `FunctionPass Manager` finishes. This seems suboptimal from cache locality perspective, and that is my only guess as to why we are seeing such huge compile-time regression in full LTO only, because full lto results in huge modules with lots of functions.

In D100802#2701348, @spatel wrote:

Patch updated:
Add explicit late LoopPassManager for LICM and disable MemorySSA usage from LICM pass on NewPM.

Hopefully, that matches what was suggested in the inline comment. Let me know if not. I should learn more about how the new PM works...
It's not clear to me what the analogous change would look like for OldPM - set the Cap options to zero?

That didn't help:
https://llvm-compile-time-tracker.com/compare.php?from=8a6772f3aa92fdf6d01303adfef0e05f5651ea7d&to=b39f226e628eb2b58e63848985bf5c2af8a1a8d7&stat=instructions

I personally think the problem isn't MSSA, but the fact that we run this Loop pass manager
after main function pass manager finishes, thus losing cache locality.

Err, after rereading this twice, this only touches LTO pipelines, doesn't it :]
Sorry,

Harbormaster completed remote builds in B99696: Diff 338832.Apr 20 2021, 6:50 AM

In D100802#2701501, @lebedev.ri wrote:

Err, after rereading this twice, this only touches LTO pipelines, doesn't it :]

Right - sorry that wasn't clear. I'm only trying to get closer to parity/consensus between the regular and LTO pipelines with this patch.

I don't know why the pipelines would diverge at this stage of the compile - there might be a good reason, or it just evolved to this state unintentionally?

So if we look at this section of the pipeline (from vectorization to proposed LICM) in non-LTO mode, we have (table formatted for Phab):

Regular	LTO
LoopVectorizePass	LoopVectorizePass
LoopLoadEliminationPass	LoopUnrollPass	???
InstCombinePass	InstCombinePass
SimplifyCFGPass	SimplifyCFGPass
	SCCPPass	???
	InstCombinePass	???
	BDCEPass	???
SLPVectorizerPass	SLPVectorizerPass
VectorCombinePass	VectorCombinePass
InstCombinePass	InstCombinePass
LoopUnrollPass	JumpThreadingPass	???
InstCombinePass		???
LICMPass	LICMPass

Maybe we should reconcile the pass differences ahead of LICM in the pipeline before adding LICM?

The change you made for the NewPM doesn't just disabled MSSA -- it enables AST instead, which is even worse :) What you'd actually want for this case is an additional option that disables the memory optimizations in LICM entirely and uses neither MSSA nor AST. That should make this change much cheaper, as MSSA/AST should be the primary costs here.

In D100802#2702510, @nikic wrote:

The change you made for the NewPM doesn't just disabled MSSA -- it enables AST instead, which is even worse :)

Oops!

What you'd actually want for this case is an additional option that disables the memory optimizations in LICM entirely and uses neither MSSA nor AST. That should make this change much cheaper, as MSSA/AST should be the primary costs here.

Ok - let me step back and ask: any ideas about what is an acceptable compile-time cost here?
It seems like we might be better off unifying the behavior between LTO and non-LTO (see table in previous comment) - potentially more compile-time savings?

I wasn't actually suggesting to disable MSSA, i was more commenting about doing this after FPM finishes.
Indeed, i think we should just match the behavior with normal optimization pipeline,
which probably means keeping MSSA.

~1% sounds about reasonable to me, i guess.

Matt added a subscriber: Matt.May 5 2021, 7:58 AM

spatel mentioned this in D102002: [PassManager] unify vector passes between regular and LTO pipelines.May 6 2021, 9:53 AM

Taking this off the active queue for now. D102002 is the more general unification patch, but that's going to have to happen in pieces to avoid a slow and endless chain of difficult regressions.

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:50 PM

Herald added subscribers: ormris, StephenFan. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Passes/

PassBuilder.cpp

6 lines

Transforms/

IPO/

PassManagerBuilder.cpp

4 lines

test/

Other/

new-pm-lto-defaults.ll

5 lines

opt-LTO-pipeline.ll

12 lines

Transforms/

PhaseOrdering/

lto-licm.ll

7 lines

Diff 338650

llvm/lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 1,833 Lines • ▼ Show 20 Lines

PassBuilder::buildLTODefaultPipeline(OptimizationLevel Level,

MainFPM.addPass(AlignmentFromAssumptionsPass());

// FIXME: Conditionally run LoadCombine here, after it's ported

// (in case we still have this pass, given its questionable usefulness).

MainFPM.addPass(InstCombinePass());

invokePeepholeEPCallbacks(MainFPM, Level);

MainFPM.addPass(JumpThreadingPass(/*InsertFreezeWhenUnfoldingSelect*/ true));

// LICM should always be run after the final InstCombine because InstCombine

// sinks instructions without regard to loop-invariance.

MainFPM.addPass(createFunctionToLoopPassAdaptor(

LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap),

EnableMSSALoopDependency, /*UseBlockFrequencyInfo=*/true, DebugLogging));

lebedev.riUnsubmitted

Done

MainFPM.addPass(JumpThreadingPass(/*InsertFreezeWhenUnfoldingSelect*/ true));

+ LoopPassManager LateLPM(DebugLogging);

// LICM should always be run after the final InstCombine because InstCombine

// sinks instructions without regard to loop-invariance.

+ LateLPM.addPass(LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap),

+ EnableMSSALoopDependency, /*UseBlockFrequencyInfo=*/true, DebugLogging)));

MainFPM.addPass(createFunctionToLoopPassAdaptor(

- LICMPass(PTO.LicmMssaOptCap, PTO.LicmMssaNoAccForPromotionCap),

- EnableMSSALoopDependency, /*UseBlockFrequencyInfo=*/true, DebugLogging));

+ std::move(LateLPM), /*UseMemorySSA=*/false, /*UseBlockFrequencyInfo=*/true,

+ DebugLogging));

MPM.addPass(createModuleToFunctionPassAdaptor(std::move(MainFPM)));

Can you try something like this, ...

lebedev.ri: Can you try something like this, ...

MPM.addPass(createModuleToFunctionPassAdaptor(std::move(MainFPM)));

lebedev.riUnsubmitted

Done

EnableMSSALoopDependency, /*UseBlockFrequencyInfo=*/true, DebugLogging));

MPM.addPass(createModuleToFunctionPassAdaptor(std::move(MainFPM)));

// Create a function that performs CFI checks for cross-DSO calls with

lebedev.ri:

// Create a function that performs CFI checks for cross-DSO calls with

// targets in the current module.

MPM.addPass(CrossDSOCFIPass());

// Lower type metadata and the type.test intrinsic. This pass supports

// clang's control flow integrity mechanisms (-fsanitize=cfi*) and needs

// to be run at link time if CFI is enabled. This pass does nothing if

▲ Show 20 Lines • Show All 1,338 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 1,183 Lines • ▼ Show 20 Lines	if (OptLevel != 0)
addLTOOptimizationPasses(PM);		addLTOOptimizationPasses(PM);
else {		else {
// The whole-program-devirt pass needs to run at -O0 because only it knows		// The whole-program-devirt pass needs to run at -O0 because only it knows
// about the llvm.type.checked.load intrinsic: it needs to both lower the		// about the llvm.type.checked.load intrinsic: it needs to both lower the
// intrinsic itself and handle it in the summary.		// intrinsic itself and handle it in the summary.
PM.add(createWholeProgramDevirtPass(ExportSummary, nullptr));		PM.add(createWholeProgramDevirtPass(ExportSummary, nullptr));
}		}

		// LICM should always be run after the final InstCombine because InstCombine
		// sinks instructions without regard to loop-invariance.
		PM.add(createLICMPass(LicmMssaOptCap, LicmMssaNoAccForPromotionCap));

// Create a function that performs CFI checks for cross-DSO calls with targets		// Create a function that performs CFI checks for cross-DSO calls with targets
// in the current module.		// in the current module.
PM.add(createCrossDSOCFIPass());		PM.add(createCrossDSOCFIPass());

// Lower type metadata and the type.test intrinsic. This pass supports Clang's		// Lower type metadata and the type.test intrinsic. This pass supports Clang's
// control flow integrity mechanisms (-fsanitize=cfi*) and needs to run at		// control flow integrity mechanisms (-fsanitize=cfi*) and needs to run at
// link time if CFI is enabled. The pass does nothing if CFI is disabled.		// link time if CFI is enabled. The pass does nothing if CFI is disabled.
PM.add(createLowerTypeTestsPass(ExportSummary, nullptr));		PM.add(createLowerTypeTestsPass(ExportSummary, nullptr));
▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

llvm/test/Other/new-pm-lto-defaults.ll

	Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	; CHECK-O2-NEXT: Running pass: SLPVectorizerPass on foo			; CHECK-O2-NEXT: Running pass: SLPVectorizerPass on foo
	; CHECK-O3-NEXT: Running pass: SLPVectorizerPass on foo			; CHECK-O3-NEXT: Running pass: SLPVectorizerPass on foo
	; CHECK-OS-NEXT: Running pass: SLPVectorizerPass on foo			; CHECK-OS-NEXT: Running pass: SLPVectorizerPass on foo
	; CHECK-O23SZ-NEXT: Running pass: VectorCombinePass on foo			; CHECK-O23SZ-NEXT: Running pass: VectorCombinePass on foo
	; CHECK-O23SZ-NEXT: Running pass: AlignmentFromAssumptionsPass on foo			; CHECK-O23SZ-NEXT: Running pass: AlignmentFromAssumptionsPass on foo
	; CHECK-O23SZ-NEXT: Running pass: InstCombinePass on foo			; CHECK-O23SZ-NEXT: Running pass: InstCombinePass on foo
	; CHECK-EP-Peephole-NEXT: Running pass: NoOpFunctionPass on foo			; CHECK-EP-Peephole-NEXT: Running pass: NoOpFunctionPass on foo
	; CHECK-O23SZ-NEXT: Running pass: JumpThreadingPass on foo			; CHECK-O23SZ-NEXT: Running pass: JumpThreadingPass on foo
				; CHECK-O23SZ-NEXT: Starting llvm::Function pass manager run.
				; CHECK-O23SZ-NEXT: Running pass: LoopSimplifyPass on foo
				; CHECK-O23SZ-NEXT: Running pass: LCSSAPass on foo
				; CHECK-O23SZ-NEXT: Finished llvm::Function pass manager run.
				; CHECK-O23SZ-NEXT: Running pass: LICMPass on Loop
	; CHECK-O23SZ-NEXT: Running pass: CrossDSOCFIPass			; CHECK-O23SZ-NEXT: Running pass: CrossDSOCFIPass
	; CHECK-O23SZ-NEXT: Running pass: LowerTypeTestsPass			; CHECK-O23SZ-NEXT: Running pass: LowerTypeTestsPass
	; CHECK-O-NEXT: Running pass: LowerTypeTestsPass			; CHECK-O-NEXT: Running pass: LowerTypeTestsPass
	; CHECK-O23SZ-NEXT: Running pass: SimplifyCFGPass			; CHECK-O23SZ-NEXT: Running pass: SimplifyCFGPass
	; CHECK-O23SZ-NEXT: Running pass: EliminateAvailableExternallyPass			; CHECK-O23SZ-NEXT: Running pass: EliminateAvailableExternallyPass
	; CHECK-O23SZ-NEXT: Running pass: GlobalDCEPass			; CHECK-O23SZ-NEXT: Running pass: GlobalDCEPass
	; CHECK-O-NEXT: Running pass: AnnotationRemarksPass on foo			; CHECK-O-NEXT: Running pass: AnnotationRemarksPass on foo
	; CHECK-O-NEXT: Running pass: PrintModulePass			; CHECK-O-NEXT: Running pass: PrintModulePass
	Show All 31 Lines

llvm/test/Other/opt-LTO-pipeline.ll

	Show First 20 Lines • Show All 170 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Optimize scalar/vector ops			; CHECK-NEXT: Optimize scalar/vector ops
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	; CHECK-NEXT: Alignment from assumptions			; CHECK-NEXT: Alignment from assumptions
	; CHECK-NEXT: Function Alias Analysis Results			; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Optimization Remark Emitter			; CHECK-NEXT: Optimization Remark Emitter
	; CHECK-NEXT: Combine redundant instructions			; CHECK-NEXT: Combine redundant instructions
	; CHECK-NEXT: Lazy Value Information Analysis			; CHECK-NEXT: Lazy Value Information Analysis
	; CHECK-NEXT: Jump Threading			; CHECK-NEXT: Jump Threading
				; CHECK-NEXT: Basic Alias Analysis (stateless AA impl)
				; CHECK-NEXT: Function Alias Analysis Results
				; CHECK-NEXT: Memory SSA
				lebedev.riUnsubmitted Not Done Reply Inline Actions This one is probably the most costly here. lebedev.ri: This one is probably the most costly here.
				; CHECK-NEXT: Natural Loop Information
				; CHECK-NEXT: Canonicalize natural loops
				; CHECK-NEXT: LCSSA Verifier
				; CHECK-NEXT: Loop-Closed SSA Form Pass
				; CHECK-NEXT: Scalar Evolution Analysis
				; CHECK-NEXT: Lazy Branch Probability Analysis
				; CHECK-NEXT: Lazy Block Frequency Analysis
				; CHECK-NEXT: Loop Pass Manager
				lebedev.riUnsubmitted Not Done Reply Inline Actions ..., i wonder if we could at least inline it into the `FunctionPass Manager` above lebedev.ri: ..., i wonder if we could at least inline it into the `FunctionPass Manager` above
				; CHECK-NEXT: Loop Invariant Code Motion
				lebedev.riUnsubmitted Not Done Reply Inline Actions The problem here is that we do this after the `FunctionPass Manager` finishes. This seems suboptimal from cache locality perspective, and that is my only guess as to why we are seeing such huge compile-time regression in full LTO only, because full lto results in huge modules with lots of functions. lebedev.ri: The problem here is that we do this after the `FunctionPass Manager` finishes. This seems…
	; CHECK-NEXT: Cross-DSO CFI			; CHECK-NEXT: Cross-DSO CFI
	; CHECK-NEXT: Lower type metadata			; CHECK-NEXT: Lower type metadata
	; CHECK-NEXT: Lower type metadata			; CHECK-NEXT: Lower type metadata
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	; CHECK-NEXT: Simplify the CFG			; CHECK-NEXT: Simplify the CFG
	; CHECK-NEXT: Eliminate Available Externally Globals			; CHECK-NEXT: Eliminate Available Externally Globals
	; CHECK-NEXT: Dead Global Elimination			; CHECK-NEXT: Dead Global Elimination
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	Show All 29 Lines

llvm/test/Transforms/PhaseOrdering/lto-licm.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -enable-new-pm=0 -std-link-opts -S < %s \| FileCheck %s			; RUN: opt -enable-new-pm=0 -std-link-opts -S < %s \| FileCheck %s
	; RUN: opt -passes='lto<O3>' -S < %s \| FileCheck %s			; RUN: opt -passes='lto<O3>' -S < %s \| FileCheck %s

	define void @hoist_fdiv(float* %a, float %b) {			define void @hoist_fdiv(float* %a, float %b) {
	; CHECK-LABEL: @hoist_fdiv(			; CHECK-LABEL: @hoist_fdiv(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
				; CHECK-NEXT: [[TMP0:%.]] = fdiv fast float 1.000000e+00, [[B:%.]]
	; CHECK-NEXT: br label [[FOR_COND:%.*]]			; CHECK-NEXT: br label [[FOR_COND:%.*]]
	; CHECK: for.cond:			; CHECK: for.cond:
	; CHECK-NEXT: [[I_0:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.]], [[FOR_INC:%.]] ]			; CHECK-NEXT: [[I_0:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[INC:%.]], [[FOR_INC:%.]] ]
	; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[I_0]], 1024			; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[I_0]], 1024
	; CHECK-NEXT: br i1 [[CMP_NOT]], label [[FOR_END:%.*]], label [[FOR_INC]]			; CHECK-NEXT: br i1 [[CMP_NOT]], label [[FOR_END:%.*]], label [[FOR_INC]]
	; CHECK: for.inc:			; CHECK: for.inc:
	; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[I_0]] to i64			; CHECK-NEXT: [[IDXPROM:%.*]] = zext i32 [[I_0]] to i64
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[IDXPROM]]			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[A:%.*]], i64 [[IDXPROM]]
	; CHECK-NEXT: [[TMP0:%.]] = load float, float [[ARRAYIDX]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[TMP1:%.]] = fdiv fast float [[TMP0]], [[B:%.]]			; CHECK-NEXT: [[TMP2:%.*]] = fmul fast float [[TMP1]], [[TMP0]]
	; CHECK-NEXT: store float [[TMP1]], float* [[ARRAYIDX]], align 4			; CHECK-NEXT: store float [[TMP2]], float* [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_0]], 1			; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_0]], 1
	; CHECK-NEXT: br label [[FOR_COND]]			; CHECK-NEXT: br label [[FOR_COND]]
	; CHECK: for.end:			; CHECK: for.end:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %for.cond			br label %for.cond

	Show All 23 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[PassManager] add late LICM to LTO pipeline to undo InstCombine sinkingChanges PlannedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 338650

llvm/lib/Passes/PassBuilder.cpp

llvm/lib/Transforms/IPO/PassManagerBuilder.cpp

llvm/test/Other/new-pm-lto-defaults.ll

llvm/test/Other/opt-LTO-pipeline.ll

llvm/test/Transforms/PhaseOrdering/lto-licm.ll

[PassManager] add late LICM to LTO pipeline to undo InstCombine sinking
Changes PlannedPublic