This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Turn on by default interleaved access vectorization
ClosedPublic

Authored by sbaranga on Aug 19 2015, 6:24 AM.

Details

Summary

This change turns on by default interleaved access vectorization
for AArch64.

We also clean up some tests which were spedifically enabling this
behaviour.

Diff Detail

Event Timeline

sbaranga updated this revision to Diff 32543.Aug 19 2015, 6:24 AM
sbaranga retitled this revision from to [AArch64] Turn on by default interleaved access vectorization.
sbaranga updated this object.
sbaranga added a subscriber: llvm-commits.

Tested with lnt,spec2000 and some other internal benchmarks (same as on ARM)

Performance Regressions - Execution Time
lnt.MultiSource/Applications/hexxagon/hexxagon 5.95%
lnt.MultiSource/Benchmarks/Olden/bh/bh 5.07%
lnt.SingleSource/Benchmarks/Shootout-C++/lists 3.02%
lnt.MultiSource/Applications/sqlite3/sqlite3 2.18%
lnt.MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm 1.59%
lnt.MultiSource/Benchmarks/MiBench/telecomm-CRC32/telecomm-CRC32 1.54%
lnt.MultiSource/Benchmarks/BitBench/five11/five11 1.20%
lnt.MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 1.19%

Performance Improvements - Execution Time
lnt.MultiSource/Benchmarks/PAQ8p/paq8p -16.27%
lnt.MultiSource/Benchmarks/VersaBench/bmm/bmm -3.78%
lnt.MultiSource/Benchmarks/BitBench/uudecode/uudecode -3.55%
lnt.SingleSource/UnitTests/Vectorizer/gcc-loops -3.31%
lnt.MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk -2.30%
lnt.MultiSource/Benchmarks/Olden/perimeter/perimeter -2.07%
lnt.SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall -1.68%
lnt.SingleSource/Benchmarks/BenchmarkGame/puzzle -1.20%

Again, no major change in lnt, and spec scores seem unaffected. Same as on ARM, I've seen improvements in other benchmarks.

rengolin accepted this revision.Aug 19 2015, 7:12 AM
rengolin added a reviewer: rengolin.

LNT is not famous for being accurate. :) And as far as I know, it's not exercising strided access that much, if at all.

LGTM. Thanks!

This revision is now accepted and ready to land.Aug 19 2015, 7:12 AM

I’m not sure about this LG and have a number of questions:

  1. Has the review of the actual interleave code been finished?
  2. What is the compile-time impact?
  3. Could you share detailed performance data on SPEC ref input (per benchmark) and perhaps some other suites you run on a regular basis?
  4. Could count the number of times interleave vectorization per benchmark and see if there is a correlation to the run-time data you measure?
  5. Do you expect impact on other architectures (not just ARM, ARM64 etc.). Data?

Thanks
Gerolf

I’m not sure about this LG and have a number of questions:

  1. Has the review of the actual interleave code been finished?

Yes, all stride / interleaved access for ARM and AArch64 have been reviewed and committed.

  1. What is the compile-time impact?

AFAIK unnoticeable. The validation phase drops out pretty quickly when strides are not possible, as much as everything else.

  1. Could you share detailed performance data on SPEC ref input (per benchmark) and perhaps some other suites you run on a regular basis?

That's easier said than done. SPEC and other benchmarks licenses are silly in that you never know how much shared is too much, until you pass that threshold.

But one thing is for sure, no one shares "detailed performance data". Ever.

In this specific case, Silviu hasn't shared any SPEC results simply because they have not changed with any statistical significance, and that's thoroughly expected, since there aren't many cases of stride vectorization opportunities in SPEC. There are, however, in other benchmarks, which they did run, and which they have seen improvements. (sorry, I can't say more than that).

It is in the interest of ARM to do as much benchmarking as possible and to be *very* accurate and responsible about it, including compile time, so I trust their investigation quality. That's why it looks good to me.

  1. Could count the number of times interleave vectorization per benchmark and see if there is a correlation to the run-time data you measure?

LNT has some, SPEC has close to none, others rely heavily on it. To be honest, the numbers are pretty much what I expected.

  1. Do you expect impact on other architectures (not just ARM, ARM64 etc.). Data?

This is just enabled for ARM and AArch64, so no other architecture will ever see this happening. It's up to other people to enable it and customise to their architecture, and certainly not for this patch.

Keep in mind that what Silviu is enabling here is a development version of the stride vectrorizer, so we can start tracking performance and fixing the corner cases. Release 3.7 is already branched and release 3.8 is a looong way away, so we'll have plenty of time to fix any issues that come up on ARM and AArch64.

All the other issues, including experimental testing of the features (by turning stride with a flag) has been done for weeks now, and all looks well. So, it's only natural to move from experimental to development stage, and keep a good number of months between development and production stages, when 3.8 branch out.

In the unlikely event that the stride vectorization is causing enough trouble that we can't fix for 3.8, we'll disable it again by default and release a stable product, but trunk will have it on by default so that people of all sides can find problems with it.

I hope that sheds some light on your doubts.

cheers,
--renato

In addition to Renato's reply:

I’m not sure about this LG and have a number of questions:

  1. Has the review of the actual interleave code been finished?
  2. What is the compile-time impact?
  3. Could you share detailed performance data on SPEC ref input (per benchmark) and perhaps some other suites you run on a regular basis?
  4. Could count the number of times interleave vectorization per benchmark and see if there is a correlation to the run-time data you measure?

I'm taking a closer look at SPEC now, but I doubt there will be a strong correlation with run-time data (only if we get lucky and optimize a hot loop). The changes weren't significant either.

  1. Do you expect impact on other architectures (not just ARM, ARM64 etc.). Data?

I suspect it wouldn't be beneficial unless the architectures backend has a way of efficiently lowering the load + shuffles to a reasonably fast instruction sequence (and this should also be reflected in the cost model). I had to do a number of fixes of fixes for ARM/AArch64 to remove the regressions I've found, so I wouldn't turn this on elsewhere without data.

Thanks
Gerolf

-Silviu

I suspect it wouldn't be beneficial unless the architectures backend has a way of efficiently lowering the load + shuffles to a reasonably fast instruction sequence (and this should also be reflected in the cost model). I had to do a number of fixes of fixes for ARM/AArch64 to remove the regressions I've found, so I wouldn't turn this on elsewhere without data.

I believe Intel's AVX512 has interleaved access that can be used to profit from strided vectorization, but that's up to the Intel folks to implement, test and benchmark.

Here are the spec2k and spec2k6 results (AArch64, Cortex-A57). There seems to be no significant change. This is probably a combination of workload types and the optimized functions not being 'hot'. The preferred workload here seems to be something like image-processing kernels (which explains why the optimization triggered a lot in the mesa benchmark).

SPEC2000

Size:

NameChange(patched/original - 1)Binary changed
gzip0.07%Y
vpr0.08%Y
gcc0.21%Y
mesa0.69%Y
art-0.04%Y
mcf0N
equake0N
crafty0Y
ammp0Y
parser0N
eon0.10%Y
perlbmk0Y
gap0N
vortex0N
bzip20N
twolf0.01%Y

Performance (only included result from changed binaries)
Negative numbers are improvements, positive numbers are regressions.

NameExecution time (patched/original – 1)
spec.cpu2000.ref.164_gzip-0.25%
spec.cpu2000.ref.175_vpr-0.55%
spec.cpu2000.ref.176_gcc-1.25%
spec.cpu2000.ref.177_mesa0.40%
spec.cpu2000.ref.179_art-1.04%
spec.cpu2000.ref.186_crafty0.20%
spec.cpu2000.ref.188_ammp0.63%
spec.cpu2000.ref.252_eon-0.48%
spec.cpu2000.ref.253_perlbmk0.86%
spec.cpu2000.ref.300_twolf-1.18%

Identified interleaved accesses in loops:

NameVectorized with IAVectorizable with IA and not profitable
gzip30
vpr13
gcc90
mesa396
art112
crafty50
ammp16
eon10
perlbmk31
twolf10

SPEC2006

Size:

NameChanged (patched/original - 1)Vectorized with IAVectorizable with IA but not profitableBinary changed
perlbench042Y
bzip2000N
gcc053Y
mcf000N
milc000N
namd0170Y
gobmk036Y
dealII0.67%23265Y
soplex0219Y
povray01915Y
hmmer-0.01%10Y
sjeng000N
libquantum0.20%12Y
h264ref0.07%310Y
lbm000N
omnetpp000N
astar000N
sphinx31.84%81Y
xalancbmk003N

The large number of optimized loops in dealII comes from a stl function getting optimized (the same function essentially gets optimized multiple times)

Performance (only included result from changed binaries)

Negative numbers are improvements, positive numbers are regressions.

NamePatched/Original - 1
spec.cpu2006.ref.400_perlbench1.49%
spec.cpu2006.ref.403_gcc0.06%
spec.cpu2006.ref.444_namd0.23%
spec.cpu2006.ref.445_gobmk0.03%
spec.cpu2006.ref.447_dealII-0.57%
spec.cpu2006.ref.450_soplex0.22%
spec.cpu2006.ref.453_povray-1.15%
spec.cpu2006.ref.456_hmmer0.31%
spec.cpu2006.ref.462_libquantum0.05%
spec.cpu2006.ref.464_h264ref0.39%
spec.cpu2006.ref.482_sphinx31.74%

The spinx3 result seems to be a variation (it went away with further runs).

I'll post some compile-time results later on (probably using a bootstrap llvm build)

Thanks,
Silviu

Hi,

I have performed a bootstrap aarch64 build of clang to measure the compile time time impact. The build was done with -j2. The measurement showed a 0.29% higher build time when enabling interleaved access vectorization.

I'll get more data points, but this looks like mostly noise to me, and it looks like turning this on doesn't have a significant impact on build times.

Given this and the spec analysis above, does anyone have any objections to turning this on by default for both arm and aarch64?

Thanks,
Silviu

No objections, Silviu. Please commit.

Hi Gerolf,

You had some objections to this before. Do you think everything is ok with the latest data?

Thanks,
Silviu

-----Original Message-----
From: Renato Golin [mailto:renato.golin@linaro.org]
Sent: 26 August 2015 11:16
To: Silviu Baranga
Cc: Amara Emerson; llvm-commits@lists.llvm.org
Subject: Re: [PATCH] D12149: [AArch64] Turn on by default interleaved
access vectorization

rengolin added a comment.

No objections, Silviu. Please commit.

http://reviews.llvm.org/D12149

  • IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782


llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

Here are the results for build time changes (with -j1) per-benchmark for spec2k/2k6. The build times for spec2k were much more stable then the ones for spec2k6. The build time improvements are probably false positives.

SPEC2000:

namechange (patch/original - 1)
164.gzip*-1.01%
175.vpr0.44%
176.gcc0.16%
181.mcf*-2.63%
186.crafty0.63%
197.parser-0.22%
252.eon-0.10%
253.perlbmk-0.06%
254.gap0.05%
255.vortex-0.23%
256.bzip2*-0.70%
300.twolf-0.23%
177.mesa1.29%
179.art*0.62%
183.equake*0%
188.ammp1.04%

SPEC2006:

namechange (patch/original - 1)
400.perlbench-2.02%
401.bzip2-3.93%
403.gcc-2.80%
429.mcf*-7.38%
433.milc-5.24%
444.namd-3.70%
445.gobmk-1.87%
447.dealII-3.17%
450.soplex-0.30%
453.povray2.71%
456.hmmer0.12%
458.sjeng0.79%
462.libquantum*1.47%
464.h264ref2.67%
470.lbm*0%
471.omnetpp1.44%
473.astar*-0.25%
482.sphinx31.30%
483.xalancbmk0.52%

\* The build workload is too small for the results to be significant. The best thing to do is to ignore these results.

I think this shows that this change doesn't significantly impact build times.

sbaranga closed this revision.Sep 1 2015, 4:27 AM

Committed in r246542.

Thanks,
Silviu