This is an archive of the discontinued LLVM Phabricator instance.

[TailCallElim] Add tailcall elimination pass to LTO Pipelines
ClosedPublic

Authored by rob.lougher on Feb 19 2019, 10:40 AM.

Details

Summary

If the following simple program is compiled with LTO the call to foobar() will not be tailcall optimized. This is because the tailcall elimination pass is only ran in the initial compilation step. This means link-time inlining is not visible to it.

------------ 1.c ----------------
extern void foobar(void);
extern void bar(int *);

void foo() {
  int a[10];
  bar(a);
  foobar();
}
--------------------------------
------------ 2.c ----------------
void bar(int *p) {
  *p = 10;
}
--------------------------------

$ clang -flto 1.c 2.c -c -O2
$ llvm-lto 1.o 2.o --exported-symbol=foo -save-merged-module -o 3.o
$ llvm-dis 3.o.merged.bc -o -

...
; Function Attrs: nounwind uwtable
define dso_local void @foo() local_unnamed_addr #0 {
entry:
  call void @foobar() #2
  ret void
}
...

Even without link-time inlining, LTO may be able to perform additional tailcall optimization due to the visibility of the nocapture attribute. For example, if the program above is modified to make bar() noinline, foobar() can still be tailcalled as the parameter to bar() is marked nocapture:

; Function Attrs: noinline norecurse nounwind uwtable writeonly
define internal fastcc void @bar(i32* nocapture %p) unnamed_addr #3 {
entry:
  store i32 10, i32* %p, align 4, !tbaa !4
  ret void
}

(Before D53519, this case would not have been optimized due to the lifetime markers.)

Diff Detail

Repository
rL LLVM

Event Timeline

rob.lougher created this revision.Feb 19 2019, 10:40 AM
efriedma added a subscriber: efriedma.

What's the general state of the LTO pipeline at this point? PassManagerBuilder::addLTOOptimizationPasses is adding passes in a really weird order (in particular, the placement of the inliner is really strange). Has anyone looked at it recently? Would it be worth killing it off in favor of the ThinLTO pipeline just so we don't have to maintain it?

What's the general state of the LTO pipeline at this point? PassManagerBuilder::addLTOOptimizationPasses is adding passes in a really weird order (in particular, the placement of the inliner is really strange). Has anyone looked at it recently? Would it be worth killing it off in favor of the ThinLTO pipeline just so we don't have to maintain it?

The reason they don't use the same pipeline is that it would be too expensive for regular LTO mode which is serial. ThinLTO can use a more aggressive post-link pipeline due to the parallelism. Mehdi did some measurements a couple years ago and found that using the ThinLTO pipeline for regular LTO increases the compile time in that mode significantly.

Regarding this patch, the change seems reasonable but I suppose any new addition to the regular LTO pipeline needs to consider the compile time vs performance tradeoff. If this patch is not too expensive then IMO it is fine to add.

What's the general state of the LTO pipeline at this point? PassManagerBuilder::addLTOOptimizationPasses is adding passes in a really weird order (in particular, the placement of the inliner is really strange). Has anyone looked at it recently? Would it be worth killing it off in favor of the ThinLTO pipeline just so we don't have to maintain it?

The reason they don't use the same pipeline is that it would be too expensive for regular LTO mode which is serial. ThinLTO can use a more aggressive post-link pipeline due to the parallelism. Mehdi did some measurements a couple years ago and found that using the ThinLTO pipeline for regular LTO increases the compile time in that mode significantly.

It doubled the link time of clang if I remember correctly.

Regarding this patch, the change seems reasonable but I suppose any new addition to the regular LTO pipeline needs to consider the compile time vs performance tradeoff. If this patch is not too expensive then IMO it is fine to add.

Sorry for the delay, it has taken some time to get compile times. I have built two very large internal codebases 10 times each with and without the pass in the LTO pipeline. The mean compile time difference (with - without as percentage of without) was:

Codebase1: -0.03%
Codebase2: -0.56%

The results are negative, because the compile time was very slightly faster with the pass. However, the values are too small to be significant. The standard deviation of the compile times was (given as a percentage of the mean):

Codebase1: without: 0.40% with: 0.41%
Codebase2: without: 0.33% with: 0.49%

As the results were not what I was expecting, I checked everything and repeated the builds. This confirmed the first results (mean compile time very slightly faster with the pass).

tejohnson accepted this revision.Mar 19 2019, 11:03 AM

Sorry missed your earlier update. LGTM. Thanks for doing the measurements!

This revision is now accepted and ready to land.Mar 19 2019, 11:03 AM
This revision was automatically updated to reflect the committed changes.

Sorry missed your earlier update. LGTM. Thanks for doing the measurements!

Thanks for the review! I've had to revert it as it was causing some LLD failures. I'll update the tests and resubmit tomorrow.