Page MenuHomePhabricator

[Passes] add a tail-call-elim pass near the end of the opt pipeline
ClosedPublic

Authored by spatel on Jul 22 2022, 11:21 AM.

Details

Summary

We call tail-call-elim near the beginning of the pipeline, but that is too early to annotate calls that get added later.

In the motivating case from issue #47852, the missing 'tail' on memset leads to sub-optimal codegen.

I experimented with removing the early instance of tail-call-elim instead of just adding another pass, but that appears to be slightly worse for compile-time: +0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving the pass to later, then adding the original early pass back (so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

Diff Detail

Event Timeline

spatel created this revision.Jul 22 2022, 11:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2022, 11:21 AM
spatel requested review of this revision.Jul 22 2022, 11:21 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2022, 11:21 AM
nikic added a comment.Jul 22 2022, 1:03 PM

I experimented with removing the early instance of tail-call-elim instead of just adding another pass, but that appears to be slightly worse for compile-time: +0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving the pass to later, then adding the original early pass back (so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

This is probably because tail calls short-circuit some AA modref queries, so we probably end up saving time there.

I'm fine with adding an extra TailCallElim pass -- I guess my main uncertainty here is where exactly it should be positioned. Might it make more sense to move it to the optimization rather than simplification pipeline? As our interest here is in tail markers rather than actual TCE, this seems like a typical late optimization pass. (On that note, we could also run only the markTails portion of TCE, but given the overall small compile-time impact, doing the split is probably not worth the effort?)

I experimented with removing the early instance of tail-call-elim instead of just adding another pass, but that appears to be slightly worse for compile-time: +0.15% vs. +0.08% time.
"tailcall" shows adding the pass; "tailcall2" shows moving the pass to later, then adding the original early pass back (so 1596886802 is functionally equivalent to 180b0439dc ):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

This is probably because tail calls short-circuit some AA modref queries, so we probably end up saving time there.

I'm fine with adding an extra TailCallElim pass -- I guess my main uncertainty here is where exactly it should be positioned. Might it make more sense to move it to the optimization rather than simplification pipeline? As our interest here is in tail markers rather than actual TCE, this seems like a typical late optimization pass. (On that note, we could also run only the markTails portion of TCE, but given the overall small compile-time impact, doing the split is probably not worth the effort?)

Yes, this should be run later. I saw MemCpyOpt and figured that was the last place we'd create memset/memcpy calls like in the bug reports, but there's no reason to run TailCallElim that early.

spatel updated this revision to Diff 447156.Jul 24 2022, 5:38 PM

Patch updated:
Changed to invoke TailCallElim much later. I put it just before the last SimplifyCFG and some late passes that I'm not familiar with - just in case they could benefit.
This mostly picks up vectorizer-related intrinsics in test diffs and doesn't seem to have any downside. The time cost even looks slightly better (not sure where the noise level is in these results, but we're probably not far off):
https://llvm-compile-time-tracker.com/index.php?config=NewPM-O3&stat=instructions&remote=rotateright

spatel retitled this revision from [Passes] add a tail-call-elim pass near the end of the function simplification pipeline to [Passes] add a tail-call-elim pass near the end of the opt pipeline.Jul 24 2022, 5:40 PM
nikic accepted this revision.Jul 25 2022, 3:31 AM

LGTM

This revision is now accepted and ready to land.Jul 25 2022, 3:31 AM

JFYI @spatel As I see this is quite similar to issue 53482. I remember some efforts to move out marking tail calls to a separate pass so this should be cheaper to do marking once again (not sure to what extent). It was not landed earlier, but it was ready to land as I understand :)

JFYI @spatel As I see this is quite similar to issue 53482. I remember some efforts to move out marking tail calls to a separate pass so this should be cheaper to do marking once again (not sure to what extent). It was not landed earlier, but it was ready to land as I understand :)

Looks like not worth it? Seems that a second instance of this pass is really cheap to run.

Just JFYI :)
Yes, probably not worth it

JFYI @spatel As I see this is quite similar to issue 53482. I remember some efforts to move out marking tail calls to a separate pass so this should be cheaper to do marking once again (not sure to what extent). It was not landed earlier, but it was ready to land as I understand :)

Thanks for the pointer! I didn't know about that patch. I'll add a note about that to the commit message, so we can revive it if we decide it is worth saving some compile-time for a little extra code/specialization.

xbolva00 accepted this revision.Jul 25 2022, 8:06 AM

LG as well

LG as well

Thanks.
I wasn't running 'check-clang' with this applied, and now I see that hundreds of those test files are affected (because they are wrongly running the entire opt), so I have to see what it takes to update those.

LG as well

Thanks.
I wasn't running 'check-clang' with this applied, and now I see that hundreds of those test files are affected (because they are wrongly running the entire opt), so I have to see what it takes to update those.

The patch with clang test diffs took forever to update (even with most tests using scripted check lines), and it is too big to post on Phab. The vast majority of changes are with AArch64's sve tests. It might have been better to just rule out target-specific intrinsics...

AArch64's sve tests

Why they use -OX at all?

AArch64's sve tests

Why they use -OX at all?

The original argument that I remember from a few years ago was that unoptimized IR or even -mem2reg only was too large + noisy to be feasible on some files, but it seems to have exploded from there. I think every file in these dirs needed updating:
https://github.com/llvm/llvm-project/tree/main/clang/test/CodeGen/aarch64-sve-intrinsics
https://github.com/llvm/llvm-project/tree/main/clang/test/CodeGen/aarch64-sve2-intrinsics

This revision was landed with ongoing or failed builds.Jul 25 2022, 12:26 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2022, 12:26 PM

Thanks a lot for fixing this!

Just JFYI :)
Yes, probably not worth it

that is interesting. do we know why?

Just JFYI :)
Yes, probably not worth it

that is interesting. do we know why?

Based on this data:
https://llvm-compile-time-tracker.com/compare.php?from=95f4ca7f5db623bacc2e34548d39fe5b28d47bad&to=bfb9b8e075ee32197157ccaf0c301122ca9b81af&stat=instructions

This patch (adding a late round of TCE) caused a ~0.06% compile-time regression for a -O3 build (less for the LTO variants). So the value of splitting the pass as proposed in D60031 depends on whether we think it's worth trying to save some (unknown?) fraction of the 0.06%.
We decided to push this for the known codegen wins, but someone can still revive the pass-splitting patch if it seems worthwhile.

One more note that I failed to mention while updating all of those clang tests: the reason those tests did not show "tail" before is because we only ran TCE with -O{2/3/s/z}, not -O1. This patch enabled TCE for all -O levels. I don't know the history/motivation for not including TCE at -O1 before, but it did not seem worth excluding based on compile-time cost. If there's another reason, we can add that limitation to the late invocation too (and it should cause most/all of the clang test diffs to revert).

Thanks for clarifying!