This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.cpp

Differential D15792

[AArch64] Turn off PredictableSelectIsExpensive on the Cortex-A57
AbandonedPublic

Authored by flyingforyou on Dec 27 2015, 5:54 PM.

Download Raw Diff

Details

Reviewers

jmolloy

Summary

There is some significant improvement in commertial benchmark(~10%).
No regression in spec2000, 2006, test-suite.
Test Env: Juno, Cortex-A57 1.1Ghz
Base Rev: r256406

Diff Detail

Event Timeline

flyingforyou updated this revision to Diff 43669.Dec 27 2015, 5:54 PM

flyingforyou retitled this revision from to [AArch64] Turn off PredictableSelectIsExpensive on the Cortex-A57.

flyingforyou updated this object.

flyingforyou added a reviewer: jmolloy.

flyingforyou added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptDec 27 2015, 5:54 PM

For more clarity, attached row datas.
I ran some benchmarks against http://reviews.llvm.org/differential/diff/43669/. (Base Revision: r256406)
Env: Juno, Cortex-A57 1.1Ghz
Test: Execute each benchmark 4 times and get average.

in Spec2000, 2006 :

Program	Opt/Ori
External/SPEC/CFP2000/177.mesa/177.mesa	100.83%
External/SPEC/CFP2000/179.art/179.art	97.09%
External/SPEC/CFP2000/183.equake/183.equake	99.97%
External/SPEC/CFP2000/188.ammp/188.ammp	100.99%
External/SPEC/CFP2006/433.milc/433.milc	100.11%
External/SPEC/CFP2006/444.namd/444.namd	99.64%
External/SPEC/CFP2006/447.dealII/447.dealII	100.71%
External/SPEC/CFP2006/450.soplex/450.soplex	100.00%
External/SPEC/CFP2006/470.lbm/470.lbm	99.61%
External/SPEC/CINT2000/164.gzip/164.gzip	100.22%
External/SPEC/CINT2000/175.vpr/175.vpr	100.63%
External/SPEC/CINT2000/176.gcc/176.gcc	101.28%
External/SPEC/CINT2000/181.mcf/181.mcf	100.55%
External/SPEC/CINT2000/186.crafty/186.crafty	101.02%
External/SPEC/CINT2000/197.parser/197.parser	100.10%
External/SPEC/CINT2000/254.gap/254.gap	100.60%
External/SPEC/CINT2000/255.vortex/255.vortex	100.88%
External/SPEC/CINT2000/256.bzip2/256.bzip2	99.90%
External/SPEC/CINT2000/300.twolf/300.twolf	100.00%
External/SPEC/CINT2006/400.perlbench/400.perlbench	100.26%
External/SPEC/CINT2006/401.bzip2/401.bzip2	99.38%
External/SPEC/CINT2006/403.gcc/403.gcc	101.02%
External/SPEC/CINT2006/429.mcf/429.mcf	100.61%
External/SPEC/CINT2006/445.gobmk/445.gobmk	99.39%
External/SPEC/CINT2006/456.hmmer/456.hmmer	98.86%
External/SPEC/CINT2006/458.sjeng/458.sjeng	100.83%
External/SPEC/CINT2006/462.libquantum/462.libquantum	100.59%
External/SPEC/CINT2006/464.h264ref/464.h264ref	100.04%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp	100.79%
External/SPEC/CINT2006/473.astar/473.astar	100.27%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk	100.00%
Geomean	100.20%

in test-suite (225 benchmarks) :

Geomean

99.94%

Junmo.

Sorry for everyone. spec & test-suite data is not correct.
I forgot insert option -mcpu=cortex-a57.

I will run test-suite, spec again and share the result soon.
If someone can do test with option mcpu=cortex-a57, please share the result for cross-check.

Junmo.

Second try.

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43669/. (Base Revision: r256477)
Env: Juno, Cortex-A57 1.1Ghz
Test: Execute each benchmark 1 time. with -mcpu=cortex-a57, O3

in Spec2000, 2006 : Lower is better.

Program	Opt/Ori
External/SPEC/CFP2000/177.mesa/177.mesa	91.34%
External/SPEC/CFP2000/179.art/179.art	88.15%
External/SPEC/CFP2000/183.equake/183.equake	102.84%
External/SPEC/CFP2000/188.ammp/188.ammp	91.68%
External/SPEC/CFP2006/433.milc/433.milc	99.35%
External/SPEC/CFP2006/444.namd/444.namd	97.36%
External/SPEC/CFP2006/447.dealII/447.dealII	98.86%
External/SPEC/CFP2006/450.soplex/450.soplex	75.00%
External/SPEC/CFP2006/470.lbm/470.lbm	101.29%
External/SPEC/CINT2000/164.gzip/164.gzip	96.93%
External/SPEC/CINT2000/175.vpr/175.vpr	99.91%
External/SPEC/CINT2000/176.gcc/176.gcc	102.28%
External/SPEC/CINT2000/181.mcf/181.mcf	100.50%
External/SPEC/CINT2000/186.crafty/186.crafty	100.70%
External/SPEC/CINT2000/197.parser/197.parser	99.60%
External/SPEC/CINT2000/254.gap/254.gap	98.54%
External/SPEC/CINT2000/255.vortex/255.vortex	99.61%
External/SPEC/CINT2000/256.bzip2/256.bzip2	98.49%
External/SPEC/CINT2000/300.twolf/300.twolf	99.22%
External/SPEC/CINT2006/400.perlbench/400.perlbench	97.61%
External/SPEC/CINT2006/401.bzip2/401.bzip2	98.13%
External/SPEC/CINT2006/403.gcc/403.gcc	100.86%
External/SPEC/CINT2006/429.mcf/429.mcf	99.22%
External/SPEC/CINT2006/445.gobmk/445.gobmk	95.12%
External/SPEC/CINT2006/456.hmmer/456.hmmer	100.11%
External/SPEC/CINT2006/458.sjeng/458.sjeng	102.40%
External/SPEC/CINT2006/462.libquantum/462.libquantum	112.90%
External/SPEC/CINT2006/464.h264ref/464.h264ref	108.48%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp	103.28%
External/SPEC/CINT2006/473.astar/473.astar	101.27%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk	96.88%
Geometric Mean	98.43%

in test-suite (225 benchmarks) :

Geometric Mean

100.81%

I think this is acceptable for Cortex-A57. I ran commertial benchmark on Galaxy S6. And I also saw some improvements too.

Junmo.

3rd try.

I ran some benchmarks against http://reviews.llvm.org/differential/diff/43669/. (Base Revision: r256477)
Env: GalaxyS6 Dev Board, Cortex-A57 2.1Ghz, MIF 1.5Ghz
Test: Execute each benchmark 4 times & get average. with -mcpu=cortex-a57, O3

in Spec2000, 2006 : Lower is better.

Program	Opt/Ori
External/SPEC/CFP2000/177.mesa/177.mesa	98.81%
External/SPEC/CFP2000/179.art/179.art	104.56%
External/SPEC/CFP2000/183.equake/183.equake	99.89%
External/SPEC/CFP2000/188.ammp/188.ammp	99.75%
External/SPEC/CFP2006/433.milc/433.milc	100.49%
External/SPEC/CFP2006/444.namd/444.namd	100.11%
External/SPEC/CFP2006/447.dealII/447.dealII	99.71%
External/SPEC/CFP2006/450.soplex/450.soplex	100.00%
External/SPEC/CFP2006/470.lbm/470.lbm	99.60%
External/SPEC/CINT2000/164.gzip/164.gzip	100.02%
External/SPEC/CINT2000/175.vpr/175.vpr	100.79%
External/SPEC/CINT2000/176.gcc/176.gcc	100.14%
External/SPEC/CINT2000/181.mcf/181.mcf	100.59%
External/SPEC/CINT2000/186.crafty/186.crafty	99.97%
External/SPEC/CINT2000/197.parser/197.parser	100.00%
External/SPEC/CINT2000/254.gap/254.gap	100.21%
External/SPEC/CINT2000/255.vortex/255.vortex	99.67%
External/SPEC/CINT2000/256.bzip2/256.bzip2	96.01%
External/SPEC/CINT2000/300.twolf/300.twolf	99.91%
External/SPEC/CINT2006/400.perlbench/400.perlbench	98.75%
External/SPEC/CINT2006/401.bzip2/401.bzip2	99.56%
External/SPEC/CINT2006/403.gcc/403.gcc	100.18%
External/SPEC/CINT2006/429.mcf/429.mcf	100.31%
External/SPEC/CINT2006/445.gobmk/445.gobmk	100.00%
External/SPEC/CINT2006/456.hmmer/456.hmmer	101.52%
External/SPEC/CINT2006/458.sjeng/458.sjeng	101.29%
External/SPEC/CINT2006/462.libquantum/462.libquantum	100.00%
External/SPEC/CINT2006/464.h264ref/464.h264ref	99.85%
External/SPEC/CINT2006/471.omnetpp/471.omnetpp	98.85%
External/SPEC/CINT2006/473.astar/473.astar	99.88%
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk	95.52%
Geometric Mean	99.86%

in test-suite (223 benchmarks) :

Geometric Mean

99.90%

This result also shows the patch may be acceptable.

Junmo.

AFAIK, the test suite runs the test or the train loads of these SPEC benchmarks, but they are not representative of the official ref loads. I prefer to see such a change confirmed by the ref loads.

junbuml added a subscriber: junbuml.Jan 5 2016, 12:06 PM

Hi Evandro.

The External directory contains Makefiles for building code that is external to (i.e., not distributed with) LLVM. The most prominent members of this directory are the SPEC 95 and SPEC 2000 benchmark suites. The External directory does not contain these actual tests, but only the Makefiles that know how to properly compile these programs from somewhere else. The presence and location of these external programs is configured by the test-suite configure script.

I can't find the information about ref loads. Before running SPEC again, I want to get confirmation which is correct from @jmolloy.

James, could you review this commit please?

mssimpso added a subscriber: mssimpso.Jan 6 2016, 4:39 AM

Hi,

Yes, SPEC has three workload types for every benchmark: "test", "train" and "ref". "ref" is the longest-running workload and is the only one that is useful in performance analysis. "test" and "train" often execute different codepaths within the benchmark too, so results are not directly comparable.

That said, I have some concerns about the data as given:

One repeat for SPEC is just not enough. The Samsung S6 numbers look substantially less variable than the Juno numbers, and it's difficult to see what is causing that when the repeat counts of each are different.

Soplex has a large improvement (25%) which I happen to know is related to imperfect branch prediction. I think this should be discounted from the results for the purposes of comparison (there is an argument that increasing the number of selects decreases pressure on the branch predictor, reducing the likeliness of branch prediction oddities however).

Libquantum, which is a really good test of predication versus branches, has regressed by 12% on Juno and yet is utterly unchanged on S6. I'm unconvinced by this.

h264, which is one of the most real-world workloads for the mobile space in SPEC has regressed by 8% on Juno.

S6 almost looks like complete noise - there are no significant changes anywhere which I have difficulty believing. The largest swing is a 4% regression in 179.art, while Juno has a 12% improvement. These do not correlate.

In fact, the pearson's correlation coefficient between the two datasets is -0.09, which is low enough to assume no correlation (even slight negative correlation!), so I distrust the numbers.

Sorry,

James

This revision now requires changes to proceed.Jan 6 2016, 6:01 AM

Thanks for confirmation James.

Evandro, you're right. As you said, when we run spec2000,2006 default RUN_TYPE in test-suite is not ref.
When I looked https://www.spec.org/cpu2000/docs/runspec.html#toc_I.C. , I found ref is default setting.

I will update data again.

James, I still think this commit might be worth for Cortex-A57. Could you give me a second chance, please?

Junmo.

In D15792#321176, @flyingforyou wrote:

James, I still think this commit might be worth for Cortex-A57. Could you give me a second chance, please?

The status is "needs revision", not "refused". :)

If you get new numbers that prove this is a good move, you should definitely try again.

The hint is to do yourself the analysis that James did. Look at the numbers with care, make sure that they're consistent and if they're not, investigate why not, and only publicise your findings when you're sure that every corner was swept and you have an answer to all issues with the data.

In a nutshell, always distrust your own numbers until you prove yourself that they're good.

cheers,
--renato

As we are on the topic of SPEC and test-suite: The SPEC running Makefiles in the test-suite do not come close to proper SPEC runs. They only ever run 1 input while SPEC often supplies multiple different inputs. They are good as “smoke” tests but do not necessarily reflect the changes in a real ref run.
If you need spec-ref numbers then I would recommend to use the runspec tool that comes with SPEC or the test-suite/LNTBased stuff (lnt runtest nt —spec-with-ref I believe). I’m also currently integrating proper SPEC support into the new cmake/lit style of the test-suite which already properly handle SPEC95 and SPEC2000 but the SPEC2006 part isn’t finished yet.

Matthias

@MatzeB, the last time I tried, the LNT failed to the ref workloads due to missing the proper hashes for the results. Also, the definitions seemed to be frozen to support just x86. Has either of these observations been addressed?

Thank you.

Junmo,

When stating SPEC results, it's better to stick to the numbers spitted by its framework. Let me know if you need help with a proper config file.

Cheers.

As far as I know, there is a mixup in reference results (rdar://23577619 rdar://23577619 for apple people). About the state of the LNTBased/—with-spec-ref you best ask Adam (CCed now).

Matthias
- msg-13858-114.txt162 BDownload

@evandro, did you run LNT with --spec-with-ref? I've been using it regularly both natively on x86 and remotely with ARM64.

@anemet, I must say that it was a few months ago and, after an attempting to figure out what was wrong and realizing that it was more than I wanted to deal with, I fell back to using the official framework. Sounds like it's worth revisiting it again, especially after stumbling at http://reviews.llvm.org/rL255876

Thank you.

In D15792#322307, @evandro wrote:

@anemet, I must say that it was a few months ago and, after an attempting to figure out what was wrong and realizing that it was more than I wanted to deal with, I fell back to using the official framework. Sounds like it's worth revisiting it again, especially after stumbling at http://reviews.llvm.org/rL255876

Well, that patch is actually unrelated to --spec-with-ref at this point. --spec-with-ref currently uses the old makefile-based framework for building SPEC and then rolls its own run command for each SPEC benchmark (http://reviews.llvm.org/D14735). But yes, long term the idea is to converge these two and use cmake with --spec-with-ref. BTW, --spec-with-ref support SPEC 2000/2006 at this point.

Hi James.

I ran SPEC with ref input as Evandro & you said. After running SPEC with ref input, I can see same result what you saw.

[AArch64] Conditional selects are expensive on out-of-order cores.
Specifically Cortex-A57. This probably applies to Cyclone too but I haven't enabled it for that as I can't test it.
This gives ~4% improvement on SPEC 174.vpr, and ~1% in 471.omnetpp.

But bzip2 which is in SPEC INT 2000/2006 and 456.hmmer show 2~3% improvemnt when I set PredictableSelectIsExpensive false;

On commertial benchmark, we can see big improvements on a few workloads. (8~12% on GS6 Dev board which uses Cortex-A57)

When below code compiled with -O3 -ffast-math -mcpu=cortex-a57 option, optimizeSelectInst(CodeGenPrepare.cpp) can be run due to isPredictableSelectIsExpensive flag set true.

float tmp1 = sqrt(tmp);
tmp4 = tmp1 == 0 ? 1 : 1 / tmp3;

Above code translated likes below.

 0:   1e21c000        fsqrt   s0, s0
 4:   1e202008        fcmp    s0, #0.0
 8:   54000061        b.ne    14 <_Z4testfff+0x14>
 c:   1e2e1000        fmov    s0, #1.000000000000000000e+00
10:   d65f03c0        ret
14:   1e2e1000        fmov    s0, #1.000000000000000000e+00
18:   1e211800        fdiv    s0, s0, s1
1c:   d65f03c0        ret

if "tmp1 == 0" always or mostly false, we should run division "fdiv s0, s1, s0".

In this case, two branches which are for select optimization can be a performance bottleneck.

I remembered that you mentioned about PGO, when I upload some flag tuning.

I think if we know PGO information, this can be handled easily.

Can we just let this flag off on "cortex-a57"? Without PGO information, it can't cover above case.

Junmo.

Abandon this patch due to SPEC 174.vpr's regression.

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

4 lines

Diff 43669

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 616 Lines • ▼ Show 20 Lines	for (MVT Ty : {MVT::v2f32, MVT::v4f32, MVT::v2f64}) {
setOperationAction(ISD::FFLOOR, Ty, Legal);		setOperationAction(ISD::FFLOOR, Ty, Legal);
setOperationAction(ISD::FNEARBYINT, Ty, Legal);		setOperationAction(ISD::FNEARBYINT, Ty, Legal);
setOperationAction(ISD::FCEIL, Ty, Legal);		setOperationAction(ISD::FCEIL, Ty, Legal);
setOperationAction(ISD::FRINT, Ty, Legal);		setOperationAction(ISD::FRINT, Ty, Legal);
setOperationAction(ISD::FTRUNC, Ty, Legal);		setOperationAction(ISD::FTRUNC, Ty, Legal);
setOperationAction(ISD::FROUND, Ty, Legal);		setOperationAction(ISD::FROUND, Ty, Legal);
}		}
}		}

// Prefer likely predicted branches to selects on out-of-order cores.
if (Subtarget->isCortexA57())
PredictableSelectIsExpensive = true;
}		}

void AArch64TargetLowering::addTypeForNEON(EVT VT, EVT PromotedBitwiseVT) {		void AArch64TargetLowering::addTypeForNEON(EVT VT, EVT PromotedBitwiseVT) {
if (VT == MVT::v2f32 \|\| VT == MVT::v4f16) {		if (VT == MVT::v2f32 \|\| VT == MVT::v4f16) {
setOperationAction(ISD::LOAD, VT.getSimpleVT(), Promote);		setOperationAction(ISD::LOAD, VT.getSimpleVT(), Promote);
AddPromotedToType(ISD::LOAD, VT.getSimpleVT(), MVT::v2i32);		AddPromotedToType(ISD::LOAD, VT.getSimpleVT(), MVT::v2i32);

setOperationAction(ISD::STORE, VT.getSimpleVT(), Promote);		setOperationAction(ISD::STORE, VT.getSimpleVT(), Promote);
▲ Show 20 Lines • Show All 9,428 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Turn off PredictableSelectIsExpensive on the Cortex-A57AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 43669

lib/Target/AArch64/AArch64ISelLowering.cpp

[AArch64] Turn off PredictableSelectIsExpensive on the Cortex-A57
AbandonedPublic