This is an archive of the discontinued LLVM Phabricator instance.

[TailCallElim] Add tailcall elimination pass to LTO Pipelines
ClosedPublic

Authored by rob.lougher on Feb 19 2019, 10:40 AM.

Download Raw Diff

Details

Reviewers

hfinkel
rnk
chandlerc
efriedma
tejohnson
mehdi_amini

Commits

rGde548ccab9f1: [TailCallElim] Add tailcall elimination pass to LTO pipelines
rL356511: [TailCallElim] Add tailcall elimination pass to LTO pipelines

Summary

If the following simple program is compiled with LTO the call to foobar() will not be tailcall optimized. This is because the tailcall elimination pass is only ran in the initial compilation step. This means link-time inlining is not visible to it.

------------ 1.c ----------------
extern void foobar(void);
extern void bar(int *);

void foo() {
  int a[10];
  bar(a);
  foobar();
}
--------------------------------
------------ 2.c ----------------
void bar(int *p) {
  *p = 10;
}
--------------------------------

$ clang -flto 1.c 2.c -c -O2
$ llvm-lto 1.o 2.o --exported-symbol=foo -save-merged-module -o 3.o
$ llvm-dis 3.o.merged.bc -o -

...
; Function Attrs: nounwind uwtable
define dso_local void @foo() local_unnamed_addr #0 {
entry:
  call void @foobar() #2
  ret void
}
...

Even without link-time inlining, LTO may be able to perform additional tailcall optimization due to the visibility of the nocapture attribute. For example, if the program above is modified to make bar() noinline, foobar() can still be tailcalled as the parameter to bar() is marked nocapture:

; Function Attrs: noinline norecurse nounwind uwtable writeonly
define internal fastcc void @bar(i32* nocapture %p) unnamed_addr #3 {
entry:
  store i32 10, i32* %p, align 4, !tbaa !4
  ret void
}

(Before D53519, this case would not have been optimized due to the lifetime markers.)

Diff Detail

Repository: rL LLVM

Event Timeline

rob.lougher created this revision.Feb 19 2019, 10:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 19 2019, 10:40 AM

Herald added subscribers: jdoerfert, dexonsmith, steven_wu and 3 others. · View Herald Transcript

What's the general state of the LTO pipeline at this point? PassManagerBuilder::addLTOOptimizationPasses is adding passes in a really weird order (in particular, the placement of the inliner is really strange). Has anyone looked at it recently? Would it be worth killing it off in favor of the ThinLTO pipeline just so we don't have to maintain it?

In D58391#1403188, @efriedma wrote:

What's the general state of the LTO pipeline at this point? PassManagerBuilder::addLTOOptimizationPasses is adding passes in a really weird order (in particular, the placement of the inliner is really strange). Has anyone looked at it recently? Would it be worth killing it off in favor of the ThinLTO pipeline just so we don't have to maintain it?

The reason they don't use the same pipeline is that it would be too expensive for regular LTO mode which is serial. ThinLTO can use a more aggressive post-link pipeline due to the parallelism. Mehdi did some measurements a couple years ago and found that using the ThinLTO pipeline for regular LTO increases the compile time in that mode significantly.

Regarding this patch, the change seems reasonable but I suppose any new addition to the regular LTO pipeline needs to consider the compile time vs performance tradeoff. If this patch is not too expensive then IMO it is fine to add.

In D58391#1408322, @tejohnson wrote:

In D58391#1403188, @efriedma wrote:

What's the general state of the LTO pipeline at this point? PassManagerBuilder::addLTOOptimizationPasses is adding passes in a really weird order (in particular, the placement of the inliner is really strange). Has anyone looked at it recently? Would it be worth killing it off in favor of the ThinLTO pipeline just so we don't have to maintain it?

The reason they don't use the same pipeline is that it would be too expensive for regular LTO mode which is serial. ThinLTO can use a more aggressive post-link pipeline due to the parallelism. Mehdi did some measurements a couple years ago and found that using the ThinLTO pipeline for regular LTO increases the compile time in that mode significantly.

It doubled the link time of clang if I remember correctly.

dmgreen added a subscriber: dmgreen.Feb 27 2019, 10:40 AM

In D58391#1408322, @tejohnson wrote:

Regarding this patch, the change seems reasonable but I suppose any new addition to the regular LTO pipeline needs to consider the compile time vs performance tradeoff. If this patch is not too expensive then IMO it is fine to add.

Sorry for the delay, it has taken some time to get compile times. I have built two very large internal codebases 10 times each with and without the pass in the LTO pipeline. The mean compile time difference (with - without as percentage of without) was:

Codebase1: -0.03%
Codebase2: -0.56%

The results are negative, because the compile time was very slightly faster with the pass. However, the values are too small to be significant. The standard deviation of the compile times was (given as a percentage of the mean):

Codebase1: without: 0.40% with: 0.41%
Codebase2: without: 0.33% with: 0.49%

As the results were not what I was expecting, I checked everything and repeated the builds. This confirmed the first results (mean compile time very slightly faster with the pass).

Ping.

Sorry missed your earlier update. LGTM. Thanks for doing the measurements!

This revision is now accepted and ready to land.Mar 19 2019, 11:03 AM

Closed by commit rL356511: [TailCallElim] Add tailcall elimination pass to LTO pipelines (authored by rlougher). · Explain WhyMar 19 2019, 1:23 PM

This revision was automatically updated to reflect the committed changes.

In D58391#1435174, @tejohnson wrote:

Sorry missed your earlier update. LGTM. Thanks for doing the measurements!

Thanks for the review! I've had to revert it as it was causing some LLD failures. I'll update the tests and resubmit tomorrow.

rob.lougher mentioned this in D59604: [LLD] Update tests for LTO pipeline change.Mar 20 2019, 10:10 AM

Diffusion mentioned this in rL356593: [TailCallElim] Update tests for LTO pipeline change.Mar 20 2019, 12:04 PM

Diffusion mentioned this in rLLD356593: [TailCallElim] Update tests for LTO pipeline change.

rob.lougher mentioned this in rG364cb6b5d70a: [TailCallElim] Update tests for LTO pipeline change.Mar 20 2019, 12:04 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Passes/

PassBuilder.cpp

4 lines

Transforms/

IPO/

PassManagerBuilder.cpp

4 lines

test/

LTO/

X86/

tailcallelim.ll

22 lines

Other/

new-pm-lto-defaults.ll

1 line

Diff 191381

llvm/trunk/lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 1,179 Lines • ▼ Show 20 Lines	else if (PGOOpt->CSAction == PGOOptions::CSIRUse)
addPGOInstrPasses(MPM, DebugLogging, Level, /* RunProfileGen */ false,		addPGOInstrPasses(MPM, DebugLogging, Level, /* RunProfileGen */ false,
/* IsCS */ true, PGOOpt->ProfileFile,		/* IsCS */ true, PGOOpt->ProfileFile,
PGOOpt->ProfileRemappingFile);		PGOOpt->ProfileRemappingFile);
}		}

// Break up allocas		// Break up allocas
FPM.addPass(SROA());		FPM.addPass(SROA());

		// LTO provides additional opportunities for tailcall elimination due to
		// link-time inlining, and visibility of nocapture attribute.
		FPM.addPass(TailCallElimPass());

// Run a few AA driver optimizations here and now to cleanup the code.		// Run a few AA driver optimizations here and now to cleanup the code.
MPM.addPass(createModuleToFunctionPassAdaptor(std::move(FPM)));		MPM.addPass(createModuleToFunctionPassAdaptor(std::move(FPM)));

MPM.addPass(createModuleToPostOrderCGSCCPassAdaptor(		MPM.addPass(createModuleToPostOrderCGSCCPassAdaptor(
PostOrderFunctionAttrsPass()));		PostOrderFunctionAttrsPass()));
// FIXME: here we run IP alias analysis in the legacy PM.		// FIXME: here we run IP alias analysis in the legacy PM.

FunctionPassManager MainFPM;		FunctionPassManager MainFPM;
▲ Show 20 Lines • Show All 969 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 890 Lines • ▼ Show 20 Lines	void PassManagerBuilder::addLTOOptimizationPasses(legacy::PassManagerBase &PM) {
// The IPO passes may leave cruft around. Clean up after them.		// The IPO passes may leave cruft around. Clean up after them.
addInstructionCombiningPass(PM);		addInstructionCombiningPass(PM);
addExtensionsToPM(EP_Peephole, PM);		addExtensionsToPM(EP_Peephole, PM);
PM.add(createJumpThreadingPass());		PM.add(createJumpThreadingPass());

// Break up allocas		// Break up allocas
PM.add(createSROAPass());		PM.add(createSROAPass());

		// LTO provides additional opportunities for tailcall elimination due to
		// link-time inlining, and visibility of nocapture attribute.
		PM.add(createTailCallEliminationPass());

// Run a few AA driven optimizations here and now, to cleanup the code.		// Run a few AA driven optimizations here and now, to cleanup the code.
PM.add(createPostOrderFunctionAttrsLegacyPass()); // Add nocapture.		PM.add(createPostOrderFunctionAttrsLegacyPass()); // Add nocapture.
PM.add(createGlobalsAAWrapperPass()); // IP alias analysis.		PM.add(createGlobalsAAWrapperPass()); // IP alias analysis.

PM.add(createLICMPass()); // Hoist loop invariants.		PM.add(createLICMPass()); // Hoist loop invariants.
PM.add(createMergedLoadStoreMotionPass()); // Merge ld/st in diamonds.		PM.add(createMergedLoadStoreMotionPass()); // Merge ld/st in diamonds.
PM.add(NewGVN ? createNewGVNPass()		PM.add(NewGVN ? createNewGVNPass()
: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies.		: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies.
▲ Show 20 Lines • Show All 219 Lines • Show Last 20 Lines

llvm/trunk/test/LTO/X86/tailcallelim.ll

				; Check that the LTO pipelines add the Tail Call Elimination pass.

				; RUN: llvm-as < %s > %t1
				; RUN: llvm-lto -o %t2 %t1 --exported-symbol=foo -save-merged-module
				; RUN: llvm-dis < %t2.merged.bc \| FileCheck %s

				; RUN: llvm-lto2 run -r %t1,foo,plx -r %t1,bar,plx -o %t3 %t1 -save-temps
				; RUN: llvm-dis < %t3.0.4.opt.bc \| FileCheck %s

				; RUN: llvm-lto2 run -r %t1,foo,plx -r %t1,bar,plx -o %t4 %t1 -save-temps -use-new-pm
				; RUN: llvm-dis < %t4.0.4.opt.bc \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				define void @foo() {
				; CHECK: tail call void @bar()
				call void @bar()
				ret void
				}

				declare void @bar()

llvm/trunk/test/Other/new-pm-lto-defaults.ll

	Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines
	; CHECK-O2-NEXT: Running pass: GlobalDCEPass			; CHECK-O2-NEXT: Running pass: GlobalDCEPass
	; CHECK-O2-NEXT: Running pass: ModuleToFunctionPassAdaptor<{{.}}PassManager{{.}}>			; CHECK-O2-NEXT: Running pass: ModuleToFunctionPassAdaptor<{{.}}PassManager{{.}}>
	; CHECK-O2-NEXT: Starting llvm::Function pass manager run.			; CHECK-O2-NEXT: Starting llvm::Function pass manager run.
	; CHECK-O2-NEXT: Running pass: InstCombinePass			; CHECK-O2-NEXT: Running pass: InstCombinePass
	; CHECK-EP-Peephole-NEXT: Running pass: NoOpFunctionPass			; CHECK-EP-Peephole-NEXT: Running pass: NoOpFunctionPass
	; CHECK-O2-NEXT: Running pass: JumpThreadingPass			; CHECK-O2-NEXT: Running pass: JumpThreadingPass
	; CHECK-O2-NEXT: Running analysis: LazyValueAnalysis			; CHECK-O2-NEXT: Running analysis: LazyValueAnalysis
	; CHECK-O2-NEXT: Running pass: SROA on foo			; CHECK-O2-NEXT: Running pass: SROA on foo
				; CHECK-O2-NEXT: Running pass: TailCallElimPass on foo
	; CHECK-O2-NEXT: Finished llvm::Function pass manager run.			; CHECK-O2-NEXT: Finished llvm::Function pass manager run.
	; CHECK-O2-NEXT: Running pass: ModuleToPostOrderCGSCCPassAdaptor<{{.*}}PostOrderFunctionAttrsPass>			; CHECK-O2-NEXT: Running pass: ModuleToPostOrderCGSCCPassAdaptor<{{.*}}PostOrderFunctionAttrsPass>
	; CHECK-O2-NEXT: Running pass: ModuleToFunctionPassAdaptor<{{.}}PassManager{{.}}>			; CHECK-O2-NEXT: Running pass: ModuleToFunctionPassAdaptor<{{.}}PassManager{{.}}>
	; CHECK-O2-NEXT: Running analysis: MemoryDependenceAnalysis			; CHECK-O2-NEXT: Running analysis: MemoryDependenceAnalysis
	; CHECK-O2-NEXT: Running analysis: PhiValuesAnalysis			; CHECK-O2-NEXT: Running analysis: PhiValuesAnalysis
	; CHECK-O2-NEXT: Running analysis: DemandedBitsAnalysis			; CHECK-O2-NEXT: Running analysis: DemandedBitsAnalysis
	; CHECK-O2-NEXT: Running pass: CrossDSOCFIPass			; CHECK-O2-NEXT: Running pass: CrossDSOCFIPass
	; CHECK-O2-NEXT: Running pass: LowerTypeTestsPass			; CHECK-O2-NEXT: Running pass: LowerTypeTestsPass
	Show All 36 Lines