Page MenuHomePhabricator

[LICM] Make promotion faster
Needs ReviewPublic

Authored by nikic on Mon, Oct 12, 1:05 PM.

Details

Reviewers
asbirlea
fhahn
Summary

As mentioned at the MSSA roundtable, LICM spends a lot of time construction an AST, despite the caps already in place. This patch is a proposal on how to reduce the impact.

The idea here is pretty simple: We're only interested in must-alias mod sets of loop invariant pointers. As such, only populate the AST with loop-invariant loads and stores (anything else is definitely not promotable) and then discard any sets with alias with any of the remaining, definitely non-promotable accesses.

If we promoted something, check whether this has made some other accesses loop invariant and thus possible promotion candidates.

This has a large positive compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=d7186fe3710828fab03de69f78f01f001d70e1aa&to=264016756045aba11ae326bfab6a632bc5ef1855&stat=instructions We save ~2% geomean at O3, and lencod in particular saves 6%.

There is no impact on the number of promotions (licm.NumPromoted) in test-suite with this change.

Diff Detail

Event Timeline

nikic created this revision.Mon, Oct 12, 1:05 PM
nikic requested review of this revision.Mon, Oct 12, 1:05 PM

An ignorant word of caution: anything can be made faster by making it do less stuff.

There's no code-size impact on any of the CTMark programs, so code quality impact of the change should be low.

That doesn't really sound like exhaustive-enough test coverage to me.

nikic edited the summary of this revision. (Show Details)Mon, Oct 12, 2:24 PM
nikic added a comment.Mon, Oct 12, 2:30 PM

An ignorant word of caution: anything can be made faster by making it do less stuff.

There's no code-size impact on any of the CTMark programs, so code quality impact of the change should be low.

That doesn't really sound like exhaustive-enough test coverage to me.

Yes, of course :) I have checked this on the full test-suite now, and there are no codegen differences with this patch. The only differences in statistics are...

basicaa.SearchLimitReached | 146199 | 128184
basicaa.SearchTimes | 50855724 | 23825700
memory-builtins.ObjectVisitorArgument | 331614 | 254610
memory-builtins.ObjectVisitorLoad | 149739 | 81225

...which just tells us that we're doing a lot less AA queries, which is exactly the intention behind this. Of course the lit test failures indicate that regressions here are possible, even if they don't show up in practice.

An ignorant word of caution: anything can be made faster by making it do less stuff.

There's no code-size impact on any of the CTMark programs, so code quality impact of the change should be low.

That doesn't really sound like exhaustive-enough test coverage to me.

Yes, of course :) I have checked this on the full test-suite now, and there are no codegen differences with this patch. The only differences in statistics are...

I guess i was thinking of even bigger coverage (SPEC?, but i personally don't have it (either?)).

basicaa.SearchLimitReached | 146199 | 128184
basicaa.SearchTimes | 50855724 | 23825700
memory-builtins.ObjectVisitorArgument | 331614 | 254610
memory-builtins.ObjectVisitorLoad | 149739 | 81225

...which just tells us that we're doing a lot less AA queries, which is exactly the intention behind this. Of course the lit test failures indicate that regressions here are possible, even if they don't show up in practice.

IIRC those run-to-run stats changes are there always, and are therefore indicative of nondeterminism.

Could you address the loop versioning test failures?

nikic updated this revision to Diff 297900.Tue, Oct 13, 10:04 AM

Rebase to pick up fix for the LoopVersioningLICM noalias metadata. Those regressions are no longer present now.

nikic updated this revision to Diff 297937.Tue, Oct 13, 1:13 PM
nikic edited the summary of this revision. (Show Details)

Rebase over NFC to make this patch smaller.

I tested this in two configurations we use for compiler releases and I'm seeing quite a few runtime regressions in the 10-20% area and a couple of 40% run time regressions.
It doesn't look worth the compile time gain with such large run time regressions.

nikic updated this revision to Diff 298196.Wed, Oct 14, 11:15 AM
nikic edited the summary of this revision. (Show Details)

Add back support for "promotion of promotion".

I tested this in two configurations we use for compiler releases and I'm seeing quite a few runtime regressions in the 10-20% area and a couple of 40% run time regressions.
It doesn't look worth the compile time gain with such large run time regressions.

Thank you for running those tests! Those results are pretty unexpected to me, as this was not supposed to change the behavior in any substantial way. The two possibilities that come to mind are:

  • Those regressions were using the "promotion of promotion" behavior. I originally did not bother supporting this because the situation seemed so contrived, and I would expect EarlyCSE/GVN or anything else doing store-load forwarding to handle it. However, it's also really simple to support and doesn't come with additional compile-time cost, so I extended the patch to handle it (thus also removing the one lit test regression).
  • The different processing order for accesses ends up mattering in those cases. Due to AATags intersection the AST is order-dependent, and the different order can result in more or less accesses being promoted (and either direction could be either good or bad for performance).

It's hard to guess without a test case to look at, and unfortunately test-suite doesn't have one. Are any of the regressions you observed in publicly available code that I could check?

I don't really want to give up on this change yet, as the compile-time impact is fairly substantial for some types of code (a few >20% wins on individual files).

A couple of public benchmarks:
singlesource polybench jacobi 1d imper is seeing a 11.9% run time regression
multisource MallocBench_gs is seeing a 15.8% run time regression
both on haswell, without thinlto.
With thinlto, eigen is seeing a lot of regressions in the 5-20%range and another benchmark has the 40% spike.

I rerun the performance numbers after the promotion of promotion patch update and the results are improved.
In opt, on haswell, there are still a few notable regressions, in the range of 5-10%, with one 32% outlier; singlesource looks resolved; multisource MallocBench_gs is still seeing a 16.3% regression
With thinlto there are still lots of 5-17% regressions in eigen, but the 40% spike is replaced with a range of 5% regression to one significant improvement.

I'll run additional configurations to determine if the gains out-weight the regressions.
On the 2 configurations I tested so far, there are enough regressions that I wouldn't push this. I wouldn't drop it just yet either, it has good potential for compile and run time.