This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Turn off PredictableSelectIsExpensive on the Cortex-A57
AbandonedPublic

Authored by flyingforyou on Dec 27 2015, 5:54 PM.

Details

Reviewers
jmolloy
Summary

There is some significant improvement in commertial benchmark(~10%).
No regression in spec2000, 2006, test-suite.
Test Env: Juno, Cortex-A57 1.1Ghz
Base Rev: r256406

Diff Detail

Event Timeline

flyingforyou retitled this revision from to [AArch64] Turn off PredictableSelectIsExpensive on the Cortex-A57.
flyingforyou updated this object.
flyingforyou added a reviewer: jmolloy.
flyingforyou added a subscriber: llvm-commits.

For more clarity, attached row datas.
I ran some benchmarks against http://reviews.llvm.org/differential/diff/43669/. (Base Revision: r256406)
Env: Juno, Cortex-A57 1.1Ghz
Test: Execute each benchmark 4 times and get average.

in Spec2000, 2006 :

ProgramOpt/Ori
External/SPEC/CFP2000/177.mesa/177.mesa100.83%
External/SPEC/CFP2000/179.art/179.art97.09%
External/SPEC/CFP2000/183.equake/183.equake99.97%
External/SPEC/CFP2000/188.ammp/188.ammp100.99%
External/SPEC/CFP2006/433.milc/433.milc100.11%
External/SPEC/CFP2006/444.namd/444.namd99.64%
External/SPEC/CFP2006/447.dealII/447.dealII100.71%
External/SPEC/CFP2006/450.soplex/450.soplex100.00%
External/SPEC/CFP2006/470.lbm/470.lbm99.61%
External/SPEC/CINT2000/164.gzip/164.gzip100.22%
External/SPEC/CINT2000/175.vpr/175.vpr100.63%
External/SPEC/CINT2000/176.gcc/176.gcc101.28%
External/SPEC/CINT2000/181.mcf/181.mcf100.55%
External/SPEC/CINT2000/186.crafty/186.crafty101.02%
External/SPEC/CINT2000/197.parser/197.parser100.10%
External/SPEC/CINT2000/254.gap/254.gap100.60%
External/SPEC/CINT2000/255.vortex/255.vortex100.88%
External/SPEC/CINT2000/256.bzip2/256.bzip299.90%
External/SPEC/CINT2000/300.twolf/300.twolf100.00%
External/SPEC/CINT2006/400.perlbench/400.perlbench100.26%
External/SPEC/CINT2006/401.bzip2/401.bzip299.38%
External/SPEC/CINT2006/403.gcc/403.gcc101.02%
External/SPEC/CINT2006/429.mcf/429.mcf100.61%
External/SPEC/CINT2006/445.gobmk/445.gobmk99.39%
External/SPEC/CINT2006/456.hmmer/456.hmmer98.86%
External/SPEC/CINT2006/458.sjeng/458.sjeng100.83%
External/SPEC/CINT2006/462.libquantum/462.libquantum100.59%
External/SPEC/CINT2006/464.h264ref/464.h264ref100.04%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp100.79%
External/SPEC/CINT2006/473.astar/473.astar100.27%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk100.00%
Geomean100.20%

in test-suite (225 benchmarks) :

Geomean99.94%

Junmo.

Sorry for everyone. spec & test-suite data is not correct.
I forgot insert option -mcpu=cortex-a57.

I will run test-suite, spec again and share the result soon.
If someone can do test with option mcpu=cortex-a57, please share the result for cross-check.

Junmo.

Second try.

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43669/. (Base Revision: r256477)
Env: Juno, Cortex-A57 1.1Ghz
Test: Execute each benchmark 1 time. with -mcpu=cortex-a57, O3

in Spec2000, 2006 : Lower is better.

ProgramOpt/Ori
External/SPEC/CFP2000/177.mesa/177.mesa91.34%
External/SPEC/CFP2000/179.art/179.art88.15%
External/SPEC/CFP2000/183.equake/183.equake102.84%
External/SPEC/CFP2000/188.ammp/188.ammp91.68%
External/SPEC/CFP2006/433.milc/433.milc99.35%
External/SPEC/CFP2006/444.namd/444.namd97.36%
External/SPEC/CFP2006/447.dealII/447.dealII98.86%
External/SPEC/CFP2006/450.soplex/450.soplex75.00%
External/SPEC/CFP2006/470.lbm/470.lbm101.29%
External/SPEC/CINT2000/164.gzip/164.gzip96.93%
External/SPEC/CINT2000/175.vpr/175.vpr99.91%
External/SPEC/CINT2000/176.gcc/176.gcc102.28%
External/SPEC/CINT2000/181.mcf/181.mcf100.50%
External/SPEC/CINT2000/186.crafty/186.crafty100.70%
External/SPEC/CINT2000/197.parser/197.parser99.60%
External/SPEC/CINT2000/254.gap/254.gap98.54%
External/SPEC/CINT2000/255.vortex/255.vortex99.61%
External/SPEC/CINT2000/256.bzip2/256.bzip298.49%
External/SPEC/CINT2000/300.twolf/300.twolf99.22%
External/SPEC/CINT2006/400.perlbench/400.perlbench97.61%
External/SPEC/CINT2006/401.bzip2/401.bzip298.13%
External/SPEC/CINT2006/403.gcc/403.gcc100.86%
External/SPEC/CINT2006/429.mcf/429.mcf99.22%
External/SPEC/CINT2006/445.gobmk/445.gobmk95.12%
External/SPEC/CINT2006/456.hmmer/456.hmmer100.11%
External/SPEC/CINT2006/458.sjeng/458.sjeng102.40%
External/SPEC/CINT2006/462.libquantum/462.libquantum112.90%
External/SPEC/CINT2006/464.h264ref/464.h264ref108.48%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp103.28%
External/SPEC/CINT2006/473.astar/473.astar101.27%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk96.88%
Geometric Mean98.43%

in test-suite (225 benchmarks) :

Geometric Mean100.81%

I think this is acceptable for Cortex-A57. I ran commertial benchmark on Galaxy S6. And I also saw some improvements too.

Junmo.

3rd try.

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43669/. (Base Revision: r256477)
Env: GalaxyS6 Dev Board, Cortex-A57 2.1Ghz, MIF 1.5Ghz
Test: Execute each benchmark 4 times & get average. with -mcpu=cortex-a57, O3

in Spec2000, 2006 : Lower is better.

ProgramOpt/Ori
External/SPEC/CFP2000/177.mesa/177.mesa98.81%
External/SPEC/CFP2000/179.art/179.art104.56%
External/SPEC/CFP2000/183.equake/183.equake99.89%
External/SPEC/CFP2000/188.ammp/188.ammp99.75%
External/SPEC/CFP2006/433.milc/433.milc100.49%
External/SPEC/CFP2006/444.namd/444.namd100.11%
External/SPEC/CFP2006/447.dealII/447.dealII99.71%
External/SPEC/CFP2006/450.soplex/450.soplex100.00%
External/SPEC/CFP2006/470.lbm/470.lbm99.60%
External/SPEC/CINT2000/164.gzip/164.gzip100.02%
External/SPEC/CINT2000/175.vpr/175.vpr100.79%
External/SPEC/CINT2000/176.gcc/176.gcc100.14%
External/SPEC/CINT2000/181.mcf/181.mcf100.59%
External/SPEC/CINT2000/186.crafty/186.crafty99.97%
External/SPEC/CINT2000/197.parser/197.parser100.00%
External/SPEC/CINT2000/254.gap/254.gap100.21%
External/SPEC/CINT2000/255.vortex/255.vortex99.67%
External/SPEC/CINT2000/256.bzip2/256.bzip296.01%
External/SPEC/CINT2000/300.twolf/300.twolf99.91%
External/SPEC/CINT2006/400.perlbench/400.perlbench98.75%
External/SPEC/CINT2006/401.bzip2/401.bzip299.56%
External/SPEC/CINT2006/403.gcc/403.gcc100.18%
External/SPEC/CINT2006/429.mcf/429.mcf100.31%
External/SPEC/CINT2006/445.gobmk/445.gobmk100.00%
External/SPEC/CINT2006/456.hmmer/456.hmmer101.52%
External/SPEC/CINT2006/458.sjeng/458.sjeng101.29%
External/SPEC/CINT2006/462.libquantum/462.libquantum100.00%
External/SPEC/CINT2006/464.h264ref/464.h264ref99.85%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp98.85%
External/SPEC/CINT2006/473.astar/473.astar99.88%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk95.52%
Geometric Mean99.86%

in test-suite (223 benchmarks) :

Geometric Mean99.90%

This result also shows the patch may be acceptable.

Junmo.

evandro added a subscriber: evandro.Jan 5 2016, 9:30 AM

AFAIK, the test suite runs the test or the train loads of these SPEC benchmarks, but they are not representative of the official ref loads. I prefer to see such a change confirmed by the ref loads.

Hi Evandro.

The External directory contains Makefiles for building code that is external to (i.e., not distributed with) LLVM. The most prominent members of this directory are the SPEC 95 and SPEC 2000 benchmark suites. The External directory does not contain these actual tests, but only the Makefiles that know how to properly compile these programs from somewhere else. The presence and location of these external programs is configured by the test-suite configure script.

I can't find the information about ref loads. Before running SPEC again, I want to get confirmation which is correct from @jmolloy.

James, could you review this commit please?

jmolloy requested changes to this revision.Jan 6 2016, 6:01 AM
jmolloy edited edge metadata.

Hi,

Yes, SPEC has three workload types for every benchmark: "test", "train" and "ref". "ref" is the longest-running workload and is the only one that is useful in performance analysis. "test" and "train" often execute different codepaths within the benchmark too, so results are not directly comparable.

That said, I have some concerns about the data as given:

  1. One repeat for SPEC is just not enough. The Samsung S6 numbers look substantially less variable than the Juno numbers, and it's difficult to see what is causing that when the repeat counts of each are different.
  1. Soplex has a large improvement (25%) which I happen to know is related to imperfect branch prediction. I think this should be discounted from the results for the purposes of comparison (there is an argument that increasing the number of selects decreases pressure on the branch predictor, reducing the likeliness of branch prediction oddities however).
  1. Libquantum, which is a really good test of predication versus branches, has regressed by 12% on Juno and yet is utterly unchanged on S6. I'm unconvinced by this.
  1. h264, which is one of the most real-world workloads for the mobile space in SPEC has regressed by 8% on Juno.
  1. S6 almost looks like complete noise - there are no significant changes anywhere which I have difficulty believing. The largest swing is a 4% regression in 179.art, while Juno has a 12% improvement. These do not correlate.
  1. In fact, the pearson's correlation coefficient between the two datasets is -0.09, which is low enough to assume no correlation (even slight negative correlation!), so I distrust the numbers.

Sorry,

James

This revision now requires changes to proceed.Jan 6 2016, 6:01 AM

Thanks for confirmation James.

Evandro, you're right. As you said, when we run spec2000,2006 default RUN_TYPE in test-suite is not ref.
When I looked https://www.spec.org/cpu2000/docs/runspec.html#toc_I.C. , I found ref is default setting.

I will update data again.

James, I still think this commit might be worth for Cortex-A57. Could you give me a second chance, please?

Junmo.

James, I still think this commit might be worth for Cortex-A57. Could you give me a second chance, please?

The status is "needs revision", not "refused". :)

If you get new numbers that prove this is a good move, you should definitely try again.

The hint is to do yourself the analysis that James did. Look at the numbers with care, make sure that they're consistent and if they're not, investigate why not, and only publicise your findings when you're sure that every corner was swept and you have an answer to all issues with the data.

In a nutshell, always distrust your own numbers until you prove yourself that they're good.

cheers,
--renato

MatzeB added a subscriber: MatzeB.Jan 7 2016, 12:56 PM

As we are on the topic of SPEC and test-suite: The SPEC running Makefiles in the test-suite do not come close to proper SPEC runs. They only ever run 1 input while SPEC often supplies multiple different inputs. They are good as “smoke” tests but do not necessarily reflect the changes in a real ref run.
If you need spec-ref numbers then I would recommend to use the runspec tool that comes with SPEC or the test-suite/LNTBased stuff (lnt runtest nt —spec-with-ref I believe). I’m also currently integrating proper SPEC support into the new cmake/lit style of the test-suite which already properly handle SPEC95 and SPEC2000 but the SPEC2006 part isn’t finished yet.

  • Matthias

@MatzeB, the last time I tried, the LNT failed to the ref workloads due to missing the proper hashes for the results. Also, the definitions seemed to be frozen to support just x86. Has either of these observations been addressed?

Thank you.

Junmo,

When stating SPEC results, it's better to stick to the numbers spitted by its framework. Let me know if you need help with a proper config file.

Cheers.

As far as I know, there is a mixup in reference results (rdar://23577619 rdar://23577619 for apple people). About the state of the LNTBased/—with-spec-ref you best ask Adam (CCed now).

  • Matthias
anemet added a subscriber: anemet.Jan 7 2016, 2:23 PM

@evandro, did you run LNT with --spec-with-ref? I've been using it regularly both natively on x86 and remotely with ARM64.

@anemet, I must say that it was a few months ago and, after an attempting to figure out what was wrong and realizing that it was more than I wanted to deal with, I fell back to using the official framework. Sounds like it's worth revisiting it again, especially after stumbling at http://reviews.llvm.org/rL255876

Thank you.

anemet added a comment.Jan 8 2016, 9:49 AM

@anemet, I must say that it was a few months ago and, after an attempting to figure out what was wrong and realizing that it was more than I wanted to deal with, I fell back to using the official framework. Sounds like it's worth revisiting it again, especially after stumbling at http://reviews.llvm.org/rL255876

Well, that patch is actually unrelated to --spec-with-ref at this point. --spec-with-ref currently uses the old makefile-based framework for building SPEC and then rolls its own run command for each SPEC benchmark (http://reviews.llvm.org/D14735). But yes, long term the idea is to converge these two and use cmake with --spec-with-ref. BTW, --spec-with-ref support SPEC 2000/2006 at this point.

Hi James.

I ran SPEC with ref input as Evandro & you said. After running SPEC with ref input, I can see same result what you saw.

[AArch64] Conditional selects are expensive on out-of-order cores.
Specifically Cortex-A57. This probably applies to Cyclone too but I haven't enabled it for that as I can't test it.
This gives ~4% improvement on SPEC 174.vpr, and ~1% in 471.omnetpp.

But bzip2 which is in SPEC INT 2000/2006 and 456.hmmer show 2~3% improvemnt when I set PredictableSelectIsExpensive false;

On commertial benchmark, we can see big improvements on a few workloads. (8~12% on GS6 Dev board which uses Cortex-A57)

When below code compiled with -O3 -ffast-math -mcpu=cortex-a57 option, optimizeSelectInst(CodeGenPrepare.cpp) can be run due to isPredictableSelectIsExpensive flag set true.

float tmp1 = sqrt(tmp);
tmp4 = tmp1 == 0 ? 1 : 1 / tmp3;

Above code translated likes below.

 0:   1e21c000        fsqrt   s0, s0
 4:   1e202008        fcmp    s0, #0.0
 8:   54000061        b.ne    14 <_Z4testfff+0x14>
 c:   1e2e1000        fmov    s0, #1.000000000000000000e+00
10:   d65f03c0        ret
14:   1e2e1000        fmov    s0, #1.000000000000000000e+00
18:   1e211800        fdiv    s0, s0, s1
1c:   d65f03c0        ret

if "tmp1 == 0" always or mostly false, we should run division "fdiv s0, s1, s0".

In this case, two branches which are for select optimization can be a performance bottleneck.

I remembered that you mentioned about PGO, when I upload some flag tuning.

I think if we know PGO information, this can be handled easily.

Can we just let this flag off on "cortex-a57"? Without PGO information, it can't cover above case.

Junmo.

flyingforyou abandoned this revision.Feb 2 2016, 5:36 AM

Abandon this patch due to SPEC 174.vpr's regression.