This change turns on by default interleaved access vectorization
for AArch64.
We also clean up some tests which were spedifically enabling this
behaviour.
Differential D12149
[AArch64] Turn on by default interleaved access vectorization sbaranga on Aug 19 2015, 6:24 AM. Authored by
Details This change turns on by default interleaved access vectorization We also clean up some tests which were spedifically enabling this
Diff Detail Event TimelineComment Actions Tested with lnt,spec2000 and some other internal benchmarks (same as on ARM) Performance Regressions - Execution Time Performance Improvements - Execution Time Again, no major change in lnt, and spec scores seem unaffected. Same as on ARM, I've seen improvements in other benchmarks. Comment Actions LNT is not famous for being accurate. :) And as far as I know, it's not exercising strided access that much, if at all. LGTM. Thanks! Comment Actions I’m not sure about this LG and have a number of questions:
Thanks Comment Actions Yes, all stride / interleaved access for ARM and AArch64 have been reviewed and committed.
AFAIK unnoticeable. The validation phase drops out pretty quickly when strides are not possible, as much as everything else.
That's easier said than done. SPEC and other benchmarks licenses are silly in that you never know how much shared is too much, until you pass that threshold. But one thing is for sure, no one shares "detailed performance data". Ever. In this specific case, Silviu hasn't shared any SPEC results simply because they have not changed with any statistical significance, and that's thoroughly expected, since there aren't many cases of stride vectorization opportunities in SPEC. There are, however, in other benchmarks, which they did run, and which they have seen improvements. (sorry, I can't say more than that). It is in the interest of ARM to do as much benchmarking as possible and to be *very* accurate and responsible about it, including compile time, so I trust their investigation quality. That's why it looks good to me.
LNT has some, SPEC has close to none, others rely heavily on it. To be honest, the numbers are pretty much what I expected.
This is just enabled for ARM and AArch64, so no other architecture will ever see this happening. It's up to other people to enable it and customise to their architecture, and certainly not for this patch. Keep in mind that what Silviu is enabling here is a development version of the stride vectrorizer, so we can start tracking performance and fixing the corner cases. Release 3.7 is already branched and release 3.8 is a looong way away, so we'll have plenty of time to fix any issues that come up on ARM and AArch64. All the other issues, including experimental testing of the features (by turning stride with a flag) has been done for weeks now, and all looks well. So, it's only natural to move from experimental to development stage, and keep a good number of months between development and production stages, when 3.8 branch out. In the unlikely event that the stride vectorization is causing enough trouble that we can't fix for 3.8, we'll disable it again by default and release a stable product, but trunk will have it on by default so that people of all sides can find problems with it. I hope that sheds some light on your doubts. cheers, Comment Actions In addition to Renato's reply: I'm taking a closer look at SPEC now, but I doubt there will be a strong correlation with run-time data (only if we get lucky and optimize a hot loop). The changes weren't significant either.
I suspect it wouldn't be beneficial unless the architectures backend has a way of efficiently lowering the load + shuffles to a reasonably fast instruction sequence (and this should also be reflected in the cost model). I had to do a number of fixes of fixes for ARM/AArch64 to remove the regressions I've found, so I wouldn't turn this on elsewhere without data.
-Silviu Comment Actions I believe Intel's AVX512 has interleaved access that can be used to profit from strided vectorization, but that's up to the Intel folks to implement, test and benchmark. Comment Actions Here are the spec2k and spec2k6 results (AArch64, Cortex-A57). There seems to be no significant change. This is probably a combination of workload types and the optimized functions not being 'hot'. The preferred workload here seems to be something like image-processing kernels (which explains why the optimization triggered a lot in the mesa benchmark). SPEC2000 Size:
Performance (only included result from changed binaries)
Identified interleaved accesses in loops:
SPEC2006 Size:
The large number of optimized loops in dealII comes from a stl function getting optimized (the same function essentially gets optimized multiple times) Performance (only included result from changed binaries) Negative numbers are improvements, positive numbers are regressions.
The spinx3 result seems to be a variation (it went away with further runs). I'll post some compile-time results later on (probably using a bootstrap llvm build) Thanks, Comment Actions Hi, I have performed a bootstrap aarch64 build of clang to measure the compile time time impact. The build was done with -j2. The measurement showed a 0.29% higher build time when enabling interleaved access vectorization. I'll get more data points, but this looks like mostly noise to me, and it looks like turning this on doesn't have a significant impact on build times. Given this and the spec analysis above, does anyone have any objections to turning this on by default for both arm and aarch64? Thanks, Comment Actions Hi Gerolf, You had some objections to this before. Do you think everything is ok with the latest data? Thanks,
ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 llvm-commits mailing list Comment Actions Here are the results for build time changes (with -j1) per-benchmark for spec2k/2k6. The build times for spec2k were much more stable then the ones for spec2k6. The build time improvements are probably false positives. SPEC2000:
SPEC2006:
\* The build workload is too small for the results to be significant. The best thing to do is to ignore these results. I think this shows that this change doesn't significantly impact build times. |