This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Make promotion faster
ClosedPublic

Authored by nikic on Oct 12 2020, 1:05 PM.

Details

Summary

As mentioned at the MSSA roundtable, LICM spends a lot of time construction an AST, despite the caps already in place. This patch is a proposal on how to reduce the impact.

The idea here is pretty simple: We're only interested in must-alias mod sets of loop invariant pointers. As such, only populate the AST with loop-invariant loads and stores (anything else is definitely not promotable) and then discard any sets which alias with any of the remaining, definitely non-promotable accesses.

If we promoted something, check whether this has made some other accesses loop invariant and thus possible promotion candidates.

This has a large positive compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=f197cf2126be3b224cadfe8b1cde9c05f638a0ea&to=7f1e24e26d198e1d80d32c87c9fd1f5cf3cc8c5f&stat=instructions We save ~1.8% geomean at O3, and lencod in particular saves 6%, with up to 25% on individual files.

There is no impact on the number of promotions (licm.NumPromoted) in test-suite in the O3 configuration with this change.

Diff Detail

Event Timeline

nikic created this revision.Oct 12 2020, 1:05 PM
nikic requested review of this revision.Oct 12 2020, 1:05 PM

An ignorant word of caution: anything can be made faster by making it do less stuff.

There's no code-size impact on any of the CTMark programs, so code quality impact of the change should be low.

That doesn't really sound like exhaustive-enough test coverage to me.

nikic edited the summary of this revision. (Show Details)Oct 12 2020, 2:24 PM
nikic added a comment.Oct 12 2020, 2:30 PM

An ignorant word of caution: anything can be made faster by making it do less stuff.

There's no code-size impact on any of the CTMark programs, so code quality impact of the change should be low.

That doesn't really sound like exhaustive-enough test coverage to me.

Yes, of course :) I have checked this on the full test-suite now, and there are no codegen differences with this patch. The only differences in statistics are...

basicaa.SearchLimitReached | 146199 | 128184
basicaa.SearchTimes | 50855724 | 23825700
memory-builtins.ObjectVisitorArgument | 331614 | 254610
memory-builtins.ObjectVisitorLoad | 149739 | 81225

...which just tells us that we're doing a lot less AA queries, which is exactly the intention behind this. Of course the lit test failures indicate that regressions here are possible, even if they don't show up in practice.

An ignorant word of caution: anything can be made faster by making it do less stuff.

There's no code-size impact on any of the CTMark programs, so code quality impact of the change should be low.

That doesn't really sound like exhaustive-enough test coverage to me.

Yes, of course :) I have checked this on the full test-suite now, and there are no codegen differences with this patch. The only differences in statistics are...

I guess i was thinking of even bigger coverage (SPEC?, but i personally don't have it (either?)).

basicaa.SearchLimitReached | 146199 | 128184
basicaa.SearchTimes | 50855724 | 23825700
memory-builtins.ObjectVisitorArgument | 331614 | 254610
memory-builtins.ObjectVisitorLoad | 149739 | 81225

...which just tells us that we're doing a lot less AA queries, which is exactly the intention behind this. Of course the lit test failures indicate that regressions here are possible, even if they don't show up in practice.

IIRC those run-to-run stats changes are there always, and are therefore indicative of nondeterminism.

Could you address the loop versioning test failures?

nikic updated this revision to Diff 297900.Oct 13 2020, 10:04 AM

Rebase to pick up fix for the LoopVersioningLICM noalias metadata. Those regressions are no longer present now.

nikic updated this revision to Diff 297937.Oct 13 2020, 1:13 PM
nikic edited the summary of this revision. (Show Details)

Rebase over NFC to make this patch smaller.

I tested this in two configurations we use for compiler releases and I'm seeing quite a few runtime regressions in the 10-20% area and a couple of 40% run time regressions.
It doesn't look worth the compile time gain with such large run time regressions.

nikic updated this revision to Diff 298196.Oct 14 2020, 11:15 AM
nikic edited the summary of this revision. (Show Details)

Add back support for "promotion of promotion".

I tested this in two configurations we use for compiler releases and I'm seeing quite a few runtime regressions in the 10-20% area and a couple of 40% run time regressions.
It doesn't look worth the compile time gain with such large run time regressions.

Thank you for running those tests! Those results are pretty unexpected to me, as this was not supposed to change the behavior in any substantial way. The two possibilities that come to mind are:

  • Those regressions were using the "promotion of promotion" behavior. I originally did not bother supporting this because the situation seemed so contrived, and I would expect EarlyCSE/GVN or anything else doing store-load forwarding to handle it. However, it's also really simple to support and doesn't come with additional compile-time cost, so I extended the patch to handle it (thus also removing the one lit test regression).
  • The different processing order for accesses ends up mattering in those cases. Due to AATags intersection the AST is order-dependent, and the different order can result in more or less accesses being promoted (and either direction could be either good or bad for performance).

It's hard to guess without a test case to look at, and unfortunately test-suite doesn't have one. Are any of the regressions you observed in publicly available code that I could check?

I don't really want to give up on this change yet, as the compile-time impact is fairly substantial for some types of code (a few >20% wins on individual files).

A couple of public benchmarks:
singlesource polybench jacobi 1d imper is seeing a 11.9% run time regression
multisource MallocBench_gs is seeing a 15.8% run time regression
both on haswell, without thinlto.
With thinlto, eigen is seeing a lot of regressions in the 5-20%range and another benchmark has the 40% spike.

I rerun the performance numbers after the promotion of promotion patch update and the results are improved.
In opt, on haswell, there are still a few notable regressions, in the range of 5-10%, with one 32% outlier; singlesource looks resolved; multisource MallocBench_gs is still seeing a 16.3% regression
With thinlto there are still lots of 5-17% regressions in eigen, but the 40% spike is replaced with a range of 5% regression to one significant improvement.

I'll run additional configurations to determine if the gains out-weight the regressions.
On the 2 configurations I tested so far, there are enough regressions that I wouldn't push this. I wouldn't drop it just yet either, it has good potential for compile and run time.

fhahn requested changes to this revision.Nov 2 2020, 1:50 PM

Marking as changes required, as there are some reported regression with this change.

This revision now requires changes to proceed.Nov 2 2020, 1:50 PM
nikic updated this revision to Diff 324096.Feb 16 2021, 1:28 PM
nikic edited the summary of this revision. (Show Details)

Rebase with some minor cleanup.

nikic added a comment.Feb 16 2021, 1:51 PM

@asbirlea Would you mind giving this patch another try? There have been quite a few changes on the alias analysis side in the meantime. I just ran into another case where LICM promotion takes more than 10% of compile-time and remembered this patch.

Ack, testing in progress.

Runtime impact looks reasonable now. What's the compile-time impact for the patch now?

Minor comments inline.

llvm/lib/Transforms/Scalar/LICM.cpp
461

Nit: It seems the conditions in IsPotentiallyPromotable would be useful as an early filter to prevent filling up the MaybePromotable only to clean it up shortly after.

2245

Remove the addAllInstructionsInLoopUsingMSSA API as well.

2273

A quick exit check for promotion is whether the AST is saturated.
Probably need to add the API for that inside the AliasSetTracker.

bool isSaturated(){
    return AliasAnyAS != nullptr;
}
if (AST.isSaturated())
    return {}; // Nothing to promote...
nikic updated this revision to Diff 324918.Feb 19 2021, 1:40 AM
nikic marked an inline comment as done.

Remove addAllInstructionsInLoopUsingMSSA() API.

nikic added a comment.Feb 19 2021, 1:51 AM

Runtime impact looks reasonable now. What's the compile-time impact for the patch now?

The link in the summary is already an updated run (https://llvm-compile-time-tracker.com/compare.php?from=f197cf2126be3b224cadfe8b1cde9c05f638a0ea&to=7f1e24e26d198e1d80d32c87c9fd1f5cf3cc8c5f&stat=instructions), so this is now ~1.8% geomean improvement at O3, with 5.5% on lencod in particular.

llvm/lib/Transforms/Scalar/LICM.cpp
461

The idea here is that some accesses may not be promotable initially, but may become promotable after others have been promoted. Each iteration removes all potentially promotable accesses from MaybePromotable, leaving the rest for the next iteration (if any promotion happened). For this reason we can't do early filtering here. (See also the comment on the loop below.)

2273

If I'm understanding how this works correctly, won't the "saturated" case be handled automatically? If the AST is saturated, then there will only be a single may-alias set, so the loop below will only iterate once and return afterwards.

asbirlea accepted this revision.Mar 2 2021, 12:58 PM
asbirlea added inline comments.
llvm/lib/Transforms/Scalar/LICM.cpp
461

I misread the condition on which accesses are kept in MaybePromotable. This is fine.

2273

Yes, you're right, this is good.

This revision was not accepted when it landed; it landed in state Needs Review.Mar 2 2021, 1:21 PM
This revision was automatically updated to reflect the committed changes.
nikic reopened this revision.Mar 9 2021, 3:13 AM
nikic updated this revision to Diff 329276.Mar 9 2021, 3:16 AM

Always use aliasesUnknownInst() API to correctly model atomic accesses. Add previously miscompiled test case.

asbirlea accepted this revision.Mar 10 2021, 5:20 PM

LGTM.

This revision was not accepted when it landed; it landed in state Needs Review.Mar 11 2021, 1:50 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.May 17 2021, 3:22 AM

It looks like this is causing Clang to crash https://bugs.llvm.org/show_bug.cgi?id=50367