This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Turn on by default interleaved access vectorization
ClosedPublic

Authored by sbaranga on Aug 19 2015, 6:24 AM.

Download Raw Diff

Details

Reviewers

Commits

rG755ec0e027f3: [AArch64] Turn on by default interleaved access vectorization
rL246542: [AArch64] Turn on by default interleaved access vectorization

Summary

This change turns on by default interleaved access vectorization
for AArch64.

We also clean up some tests which were spedifically enabling this
behaviour.

Diff Detail

Event Timeline

sbaranga updated this revision to Diff 32543.Aug 19 2015, 6:24 AM

sbaranga retitled this revision from to [AArch64] Turn on by default interleaved access vectorization.

sbaranga updated this object.

sbaranga added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptAug 19 2015, 6:24 AM

sbaranga added a parent revision: D12145: [ARM][AArch64] Turn on by default interleaved access lowering.Aug 19 2015, 6:39 AM

Tested with lnt,spec2000 and some other internal benchmarks (same as on ARM)

Performance Regressions - Execution Time
lnt.MultiSource/Applications/hexxagon/hexxagon 5.95%
lnt.MultiSource/Benchmarks/Olden/bh/bh 5.07%
lnt.SingleSource/Benchmarks/Shootout-C++/lists 3.02%
lnt.MultiSource/Applications/sqlite3/sqlite3 2.18%
lnt.MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm 1.59%
lnt.MultiSource/Benchmarks/MiBench/telecomm-CRC32/telecomm-CRC32 1.54%
lnt.MultiSource/Benchmarks/BitBench/five11/five11 1.20%
lnt.MultiSource/Benchmarks/TSVC/Expansion-dbl/Expansion-dbl 1.19%

Performance Improvements - Execution Time
lnt.MultiSource/Benchmarks/PAQ8p/paq8p -16.27%
lnt.MultiSource/Benchmarks/VersaBench/bmm/bmm -3.78%
lnt.MultiSource/Benchmarks/BitBench/uudecode/uudecode -3.55%
lnt.SingleSource/UnitTests/Vectorizer/gcc-loops -3.31%
lnt.MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk -2.30%
lnt.MultiSource/Benchmarks/Olden/perimeter/perimeter -2.07%
lnt.SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall -1.68%
lnt.SingleSource/Benchmarks/BenchmarkGame/puzzle -1.20%

Again, no major change in lnt, and spec scores seem unaffected. Same as on ARM, I've seen improvements in other benchmarks.

LNT is not famous for being accurate. :) And as far as I know, it's not exercising strided access that much, if at all.

LGTM. Thanks!

This revision is now accepted and ready to land.Aug 19 2015, 7:12 AM

I’m not sure about this LG and have a number of questions:

Has the review of the actual interleave code been finished?
What is the compile-time impact?
Could you share detailed performance data on SPEC ref input (per benchmark) and perhaps some other suites you run on a regular basis?
Could count the number of times interleave vectorization per benchmark and see if there is a correlation to the run-time data you measure?
Do you expect impact on other architectures (not just ARM, ARM64 etc.). Data?

Thanks
Gerolf

In D12149#228259, @llvm-commits wrote:

I’m not sure about this LG and have a number of questions:

Has the review of the actual interleave code been finished?

Yes, all stride / interleaved access for ARM and AArch64 have been reviewed and committed.

What is the compile-time impact?

AFAIK unnoticeable. The validation phase drops out pretty quickly when strides are not possible, as much as everything else.

Could you share detailed performance data on SPEC ref input (per benchmark) and perhaps some other suites you run on a regular basis?

That's easier said than done. SPEC and other benchmarks licenses are silly in that you never know how much shared is too much, until you pass that threshold.

But one thing is for sure, no one shares "detailed performance data". Ever.

In this specific case, Silviu hasn't shared any SPEC results simply because they have not changed with any statistical significance, and that's thoroughly expected, since there aren't many cases of stride vectorization opportunities in SPEC. There are, however, in other benchmarks, which they did run, and which they have seen improvements. (sorry, I can't say more than that).

It is in the interest of ARM to do as much benchmarking as possible and to be *very* accurate and responsible about it, including compile time, so I trust their investigation quality. That's why it looks good to me.

Could count the number of times interleave vectorization per benchmark and see if there is a correlation to the run-time data you measure?

LNT has some, SPEC has close to none, others rely heavily on it. To be honest, the numbers are pretty much what I expected.

Do you expect impact on other architectures (not just ARM, ARM64 etc.). Data?

This is just enabled for ARM and AArch64, so no other architecture will ever see this happening. It's up to other people to enable it and customise to their architecture, and certainly not for this patch.

Keep in mind that what Silviu is enabling here is a development version of the stride vectrorizer, so we can start tracking performance and fixing the corner cases. Release 3.7 is already branched and release 3.8 is a looong way away, so we'll have plenty of time to fix any issues that come up on ARM and AArch64.

All the other issues, including experimental testing of the features (by turning stride with a flag) has been done for weeks now, and all looks well. So, it's only natural to move from experimental to development stage, and keep a good number of months between development and production stages, when 3.8 branch out.

In the unlikely event that the stride vectorization is causing enough trouble that we can't fix for 3.8, we'll disable it again by default and release a stable product, but trunk will have it on by default so that people of all sides can find problems with it.

I hope that sheds some light on your doubts.

cheers,
--renato

In addition to Renato's reply:

In D12149#228259, @llvm-commits wrote:

I’m not sure about this LG and have a number of questions:

Has the review of the actual interleave code been finished?

What is the compile-time impact?

Could you share detailed performance data on SPEC ref input (per benchmark) and perhaps some other suites you run on a regular basis?

Could count the number of times interleave vectorization per benchmark and see if there is a correlation to the run-time data you measure?

I'm taking a closer look at SPEC now, but I doubt there will be a strong correlation with run-time data (only if we get lucky and optimize a hot loop). The changes weren't significant either.

Do you expect impact on other architectures (not just ARM, ARM64 etc.). Data?

I suspect it wouldn't be beneficial unless the architectures backend has a way of efficiently lowering the load + shuffles to a reasonably fast instruction sequence (and this should also be reflected in the cost model). I had to do a number of fixes of fixes for ARM/AArch64 to remove the regressions I've found, so I wouldn't turn this on elsewhere without data.

Thanks
Gerolf

-Silviu

In D12149#228827, @sbaranga wrote:

I suspect it wouldn't be beneficial unless the architectures backend has a way of efficiently lowering the load + shuffles to a reasonably fast instruction sequence (and this should also be reflected in the cost model). I had to do a number of fixes of fixes for ARM/AArch64 to remove the regressions I've found, so I wouldn't turn this on elsewhere without data.

I believe Intel's AVX512 has interleaved access that can be used to profit from strided vectorization, but that's up to the Intel folks to implement, test and benchmark.

Here are the spec2k and spec2k6 results (AArch64, Cortex-A57). There seems to be no significant change. This is probably a combination of workload types and the optimized functions not being 'hot'. The preferred workload here seems to be something like image-processing kernels (which explains why the optimization triggered a lot in the mesa benchmark).

SPEC2000

Size:

Name	Change(patched/original - 1)	Binary changed
gzip	0.07%	Y
vpr	0.08%	Y
gcc	0.21%	Y
mesa	0.69%	Y
art	-0.04%	Y
mcf	0	N
equake	0	N
crafty	0	Y
ammp	0	Y
parser	0	N
eon	0.10%	Y
perlbmk	0	Y
gap	0	N
vortex	0	N
bzip2	0	N
twolf	0.01%	Y

Performance (only included result from changed binaries)
Negative numbers are improvements, positive numbers are regressions.

Name	Execution time (patched/original – 1)
spec.cpu2000.ref.164_gzip	-0.25%
spec.cpu2000.ref.175_vpr	-0.55%
spec.cpu2000.ref.176_gcc	-1.25%
spec.cpu2000.ref.177_mesa	0.40%
spec.cpu2000.ref.179_art	-1.04%
spec.cpu2000.ref.186_crafty	0.20%
spec.cpu2000.ref.188_ammp	0.63%
spec.cpu2000.ref.252_eon	-0.48%
spec.cpu2000.ref.253_perlbmk	0.86%
spec.cpu2000.ref.300_twolf	-1.18%

Identified interleaved accesses in loops:

Name	Vectorized with IA	Vectorizable with IA and not profitable
gzip	3	0
vpr	1	3
gcc	9	0
mesa	39	6
art	1	12
crafty	5	0
ammp	1	6
eon	1	0
perlbmk	3	1
twolf	1	0

SPEC2006

Size:

Name	Changed (patched/original - 1)	Vectorized with IA	Vectorizable with IA but not profitable	Binary changed
perlbench	0	4	2	Y
bzip2	0	0	0	N
gcc	0	5	3	Y
mcf	0	0	0	N
milc	0	0	0	N
namd	0	17	0	Y
gobmk	0	3	6	Y
dealII	0.67%	232	65	Y
soplex	0	2	19	Y
povray	0	19	15	Y
hmmer	-0.01%	1	0	Y
sjeng	0	0	0	N
libquantum	0.20%	1	2	Y
h264ref	0.07%	3	10	Y
lbm	0	0	0	N
omnetpp	0	0	0	N
astar	0	0	0	N
sphinx3	1.84%	8	1	Y
xalancbmk	0	0	3	N

The large number of optimized loops in dealII comes from a stl function getting optimized (the same function essentially gets optimized multiple times)

Performance (only included result from changed binaries)

Negative numbers are improvements, positive numbers are regressions.

Name	Patched/Original - 1
spec.cpu2006.ref.400_perlbench	1.49%
spec.cpu2006.ref.403_gcc	0.06%
spec.cpu2006.ref.444_namd	0.23%
spec.cpu2006.ref.445_gobmk	0.03%
spec.cpu2006.ref.447_dealII	-0.57%
spec.cpu2006.ref.450_soplex	0.22%
spec.cpu2006.ref.453_povray	-1.15%
spec.cpu2006.ref.456_hmmer	0.31%
spec.cpu2006.ref.462_libquantum	0.05%
spec.cpu2006.ref.464_h264ref	0.39%
spec.cpu2006.ref.482_sphinx3	1.74%

The spinx3 result seems to be a variation (it went away with further runs).

I'll post some compile-time results later on (probably using a bootstrap llvm build)

Thanks,
Silviu

Hi,

I have performed a bootstrap aarch64 build of clang to measure the compile time time impact. The build was done with -j2. The measurement showed a 0.29% higher build time when enabling interleaved access vectorization.

I'll get more data points, but this looks like mostly noise to me, and it looks like turning this on doesn't have a significant impact on build times.

Given this and the spec analysis above, does anyone have any objections to turning this on by default for both arm and aarch64?

Thanks,
Silviu

No objections, Silviu. Please commit.

Hi Gerolf,

You had some objections to this before. Do you think everything is ok with the latest data?

Thanks,
Silviu

-----Original Message-----
From: Renato Golin [mailto:renato.golin@linaro.org]
Sent: 26 August 2015 11:16
To: Silviu Baranga
Cc: Amara Emerson; llvm-commits@lists.llvm.org
Subject: Re: [PATCH] D12149: [AArch64] Turn on by default interleaved
access vectorization

rengolin added a comment.

No objections, Silviu. Please commit.

http://reviews.llvm.org/D12149

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782

llvm-commits mailing list
llvm-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

Here are the results for build time changes (with -j1) per-benchmark for spec2k/2k6. The build times for spec2k were much more stable then the ones for spec2k6. The build time improvements are probably false positives.

SPEC2000:

name	change (patch/original - 1)
164.gzip*	-1.01%
175.vpr	0.44%
176.gcc	0.16%
181.mcf*	-2.63%
186.crafty	0.63%
197.parser	-0.22%
252.eon	-0.10%
253.perlbmk	-0.06%
254.gap	0.05%
255.vortex	-0.23%
256.bzip2*	-0.70%
300.twolf	-0.23%
177.mesa	1.29%
179.art*	0.62%
183.equake*	0%
188.ammp	1.04%

SPEC2006:

name	change (patch/original - 1)
400.perlbench	-2.02%
401.bzip2	-3.93%
403.gcc	-2.80%
429.mcf*	-7.38%
433.milc	-5.24%
444.namd	-3.70%
445.gobmk	-1.87%
447.dealII	-3.17%
450.soplex	-0.30%
453.povray	2.71%
456.hmmer	0.12%
458.sjeng	0.79%
462.libquantum*	1.47%
464.h264ref	2.67%
470.lbm*	0%
471.omnetpp	1.44%
473.astar*	-0.25%
482.sphinx3	1.30%
483.xalancbmk	0.52%

\* The build workload is too small for the results to be significant. The best thing to do is to ignore these results.

I think this shows that this change doesn't significantly impact build times.

sbaranga closed this revision.Sep 1 2015, 4:27 AM

Committed in r246542.

Thanks,
Silviu

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

2 lines

test/

Transforms/

LoopVectorize/

AArch64/

arbitrary-induction-step.ll

4 lines

interleaved_cost.ll

2 lines

Diff 32543

lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty);		Type *Ty);
TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);		TTI::PopcntSupportKind getPopcntSupport(unsigned TyWidth);

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

		bool enableInterleavedAccessVectorization() { return true; }

unsigned getNumberOfRegisters(bool Vector) {		unsigned getNumberOfRegisters(bool Vector) {
if (Vector) {		if (Vector) {
if (ST->hasNEON())		if (ST->hasNEON())
return 32;		return 32;
return 0;		return 0;
}		}
return 31;		return 31;
}		}
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll

	; RUN: opt -S < %s -loop-vectorize -force-vector-interleave=2 -force-vector-width=4 -enable-interleaved-mem-accesses=true \| FileCheck %s			; RUN: opt -S < %s -loop-vectorize -force-vector-interleave=2 -force-vector-width=4 \| FileCheck %s
	; RUN: opt -S < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=2 -enable-interleaved-mem-accesses=true \| FileCheck %s --check-prefix=FORCE-VEC			; RUN: opt -S < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=2 \| FileCheck %s --check-prefix=FORCE-VEC

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnueabi"			target triple = "aarch64--linux-gnueabi"

	; Test integer induction variable of step 2:			; Test integer induction variable of step 2:
	; for (int i = 0; i < 1024; i+=2) {			; for (int i = 0; i < 1024; i+=2) {
	; int tmp = *A++;			; int tmp = *A++;
	; sum += i * tmp;			; sum += i * tmp;
	▲ Show 20 Lines • Show All 137 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/AArch64/interleaved_cost.ll

	; RUN: opt -S -debug-only=loop-vectorize -loop-vectorize -instcombine -enable-interleaved-mem-accesses=true < %s 2>&1 \| FileCheck %s			; RUN: opt -S -debug-only=loop-vectorize -loop-vectorize -instcombine < %s 2>&1 \| FileCheck %s
	; REQUIRES: asserts			; REQUIRES: asserts

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnueabi"			target triple = "aarch64--linux-gnueabi"

	@AB = common global [1024 x i8] zeroinitializer, align 4			@AB = common global [1024 x i8] zeroinitializer, align 4
	@CD = common global [1024 x i8] zeroinitializer, align 4			@CD = common global [1024 x i8] zeroinitializer, align 4

	Show All 30 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Turn on by default interleaved access vectorizationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 32543

lib/Target/AArch64/AArch64TargetTransformInfo.h

test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll

test/Transforms/LoopVectorize/AArch64/interleaved_cost.ll

[AArch64] Turn on by default interleaved access vectorization
ClosedPublic