This is an archive of the discontinued LLVM Phabricator instance.

LSR: an alternative way to resolve complex solution
ClosedPublic

Authored by evstupac on Feb 10 2017, 7:49 PM.

Details

Summary

The patch introduces alternative method of resolving complex LSR solutions.
The method is based on choosing lowest mathematical expectation of registers in solution.
It should be generally faster. On my benchmarks x86 32 bits performance is better:
164.gzip +2%
viterbi algorithm benchmark ~10%

I'd like to commit this under an option for testing (now true by default for testing).

Diff Detail

Repository
rL LLVM

Event Timeline

evstupac created this revision.Feb 10 2017, 7:49 PM
qcolombet edited edge metadata.Feb 16 2017, 5:55 PM

Hi,

That looks interesting, thanks for working on this.

Regarding the performance numbers you mentioned, I'm a bit confused, are those compile time numbers?

Cheers,
-Quentin

lib/Transforms/Scalar/LoopStrengthReduce.cpp
4337

I am not a fan of using float in heuristic as they can introduce subtle difference from one target to other.
Could we keep the numerator and denominator as two separate variables and do the comparison accordingly?

We may already have helper class for that. (what is used in BlockFrequency maybe?)

Hi Quentin,

Regarding the performance numbers you mentioned, I'm a bit confused, are those compile time numbers?

This is performance numbers. I don' run compile time tests. The case when we exceed 65536 variants is not that frequent. So I don't expect much difference.
However, we could lower the limit (if performance is unchanged) - that will give some compile time advantage.

Initially motivation example comes from zlib deflate algorithm which becomes ~10-30% (depending on x86 32bit CPU) faster.

Thanks,
Evgeny

lib/Transforms/Scalar/LoopStrengthReduce.cpp
4337

I am not a fan of using float in heuristic as they can introduce subtle difference from one target to other.
Could we keep the numerator and denominator as two separate variables and do the comparison accordingly?

Yes. That is doable. However, most likely will be slower.
I'll take a look what we can do here (maybe using BlockFrequency).

Hi,

Thanks for the clarification.

It should be generally faster.

I was confused by this statement thinking you were trying to improve compile time :).

LGTM with two caveats:

  1. Should we set the option to true by default? I believe it would be best to keep it to false and send an email to llvm-dev to ask for benchmarking.
  2. The floating point thingy that I mentioned.

I am fine with moving forward with the current patch as long as you commit to look at #2.

Cheers,
Q.

lib/Transforms/Scalar/LoopStrengthReduce.cpp
4337

BlockFrequency is probably not the best abstraction but maybe it uses something we can reuse here.

qcolombet accepted this revision.Feb 17 2017, 2:44 PM
This revision is now accepted and ready to land.Feb 17 2017, 2:44 PM
  1. Should we set the option to true by default? I believe it would be best to keep it to false and send an email to llvm-dev to ask for benchmarking.
  2. The floating point thingy that I mentioned.

I am fine with moving forward with the current patch as long as you commit to look at #2.

Thanks, Quentin.
I'm OOO next week. Committing the patch as is is the best option for me to get some feedback during my OOO.
I'll commit it with the option set to "false" by default.
It was set true in the review for 2 reasons: testing (if someone download patch as is) and to show influence on LIT tests.
And I will not make the option true until a decision on #2. I'll look into it once I'm back.

Thanks,
Evgeny

This revision was automatically updated to reflect the committed changes.

Hello,

Looks like you committed with the options set to true. This seems to introduce some regressions we are seeing in internal benchmarks. While there are some good improvements too, some benchmarks are regressing 15-20%. For example I believe these Shootout-C++ matrix regressions are due to this commit:
http://llvm.org/perf/db_default/v4/nts/daily_report/2017/2/21?day_start=16

Unless you have any objections, I think we should set the option default to false for the time being, like the review says.

Thanks,
Dave

For example I believe these Shootout-C++ matrix regressions are due to this commit

Yes. However the regression was triggered by r295538, which I missed in my testing.
Anyway this is a regression and I need to address it somehow.
And yes I have no objections.

Thanks,
Evgeny

bmakam added a subscriber: bmakam.Feb 28 2017, 9:51 AM

FWIW, we are seeing a 9% regression in spec2006/hmmer on our AArch64 Kryo target with this flag turned on by default.

Thanks for reporting the regressions. It looks like I have a simple fix for both.
Here is the case:
Use1 (Address):
{a} + {0,+,1} register num expectation 1 + 1/2
{a,+,1}
register num expectation 1
Use2 (Address):
{b} + {0,+,1}
{b,+,1}
Use3 (ICmpZero):
-1024 + {0,+,1}

That way new method will select {a,+1} {b,+,1} and -1024 + {0,+,1} which is not optimal in terms of AddRecExprs (still optimal in terms of RegNum)

In Matrix test there are 32 Address uses like above. The expectation of {a} + {0,+,1} becomes very close 1, but still grater than 1.
The solution is simple - delete formulas in ICmpZero before Address. That way the optimal solution is selected (as {0,+,1} becomes unique).

And I'm testing new patch without float (with APInt numerator and denominator).

I have a fix to address both regressions and going to publish it after the review: https://reviews.llvm.org/D30527.
Before this should I revert the option to false?

Thanks,
Evgeny

The following patch should fix both matrix and hmmer regressions:
https://reviews.llvm.org/D30552

Thanks,
Evgeny

Thanks for looking into the regression. I tested D30552 on our AArch64 Kryo target and for spec2006/hmmer it recovered some of the lost performance, however it is still 2% regressed compared to 9% regression previously with lsr-exp-narrow flag on by default.

however it is still 2% regressed

I don't have a system on board. I've just looked into the changes.
Are there any gains in other benchmarks that can cover this 2% regression?
I mean is it ok to leave the regression or we should look deeper?

LGTM with two caveats:

  1. Should we set the option to true by default? I believe it would be best to keep it to false and send an email to llvm-dev to ask for benchmarking.

I strongly agree with Quentin in that this should have been committed with the feature disabled by default. I appreciate that the regressions are being address, but that should happen prior to the feature being enabled. I say this because this is a rather pervasive change that hasn't been full vetted on targets other than x86.

Here are some results from our LTO config on Kryo:
• Spec2006/h264ref -2.8%
• Spec2006/hmmer -8.9%
• Spec2006/perlbench +3.5%
• Spec2006/sjeng +2.3%
• Spec2006/dealII +2.3%
• Spec2006/sphinx3 -3.2%
• Spec2000/crafty +2%
• Spec2000/eon -2.4%
• Spec2000/mcf -2.3%
• Spec2000/perlbmk -2.8%
• Spec2000/mesa -3.2%

As you can see this is a net loss. Admittedly, this is a comparison against the prior week's performance and includes many changes, but we have at least bisected the the spec2006/hmmer regression to this commit (yes, I understand you have a fix pending, but that's beside the point). I've asked Balaram @bmakam to perform more exhaustive testing on our platform. In the mean time, can you please disable this feature (as was originally suggested) until we've done some amount of due diligence?

Thanks for the data.
Sure, I'll set the option to false today.

Thanks,
Evgeny

Thanks for the data.
Sure, I'll set the option to false today.

Thank you!

Regarding other regressions in spec2006.
New method does not guarantee perfect solution. So I think it would be fair to apply it if it generally demonstrate better code.
By generally I mean, that summ of all LSR solution registers (say in a benchmark) become lower.
I can collect such statistic for you arch (please proved me with exact options).
If new method generally select more registers for LSR solutions I'll need to fix this.

Hi Evgeny,
Has the fix for hmmer been committed? If not, we need that committed as a first step. Once that's complete we can reevaluate the performance on AArch64. I'd like to proceed by determine the specific cause of each regression and address them accordingly.

bmakam added a comment.EditedMar 7 2017, 8:59 AM

Regarding other regressions in spec2006.
New method does not guarantee perfect solution. So I think it would be fair to apply it if it generally demonstrate better code.
By generally I mean, that summ of all LSR solution registers (say in a benchmark) become lower.
I can collect such statistic for you arch (please proved me with exact options).
If new method generally select more registers for LSR solutions I'll need to fix this.

The options we use to test are:

  1. O3 -fno-strict-aliasing -mcpu=kryo -fomit-frame-pointer
  2. -Ofast -flto -fuse-ld=gold -mcpu=kryo -fomit-frame-pointer

In both configs we see regressions with this patch. Even after applying D30552 that almost fixes spec2006/hmmer we see a net loss of 0.4% geomean in LTO config.
spec2000/gzip and spec2006/hmmer (viterbi algorithm) seems to be the motivational benchmarks for this change but instead of improvement the performance is either neutral or negative.