This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Enable partial unrolling and runtime unrolling for AArch64 target
ClosedPublic

Authored by kevin.qin on Sep 2 2014, 3:09 AM.

Details

Reviewers
rengolin
echristo
Summary

This patch is to enable partial unrolling and runtime unrolling for AArch64 target. Applying this patch with the runtime unrolling prologue changes I just sent together, the SPEC2000 got improved by 0.6%, and for SPEC2006, the improved number is 0.8%. For code size, the images of two benchmarks got same 20% inflation. This experiment is done on A57.

Thanks,
Kevin

Diff Detail

Event Timeline

kevin.qin updated this revision to Diff 13157.Sep 2 2014, 3:09 AM
kevin.qin retitled this revision from to [AArch64] Enable partial unrolling and runtime unrolling for AArch64 target.
kevin.qin updated this object.
kevin.qin edited the test plan for this revision. (Show Details)
kevin.qin added a subscriber: Unknown Object (MLST).

Can you go into a bit more detail about the benchmarking you did here? What were the size differences? Do you see any notable changes on various benchmarks in particular? Etc.

Thanks!

Hi Eric,

I list the top 5 performance improvements and regressions along with the
code size changes shown in bracket on SPEC2000 & SPEC2006. They are:

SPEC2006

Top 5 performance improvements
403_gcc 95.86% (107.71%)
482_sphinx3 96.36% (139.90%)
433_milc 96.43% (120.59%)
456_hmmer 97.45% (140.49%)
470_lbm 97.76% (108.70%)

Top 5 regressions
400_perlbench 102.54% (108.20%)
401_bzip2 101.71% (126.37%)
445_gobmk 101.63% (103.01%)
458_sjeng 100.96% (110.19%)
450_soplex 100.67% (135.32%)

geomean changes 99.15% (119.76%)

SPEC2000

Top 5 improvements
179_art 92.89% (158.06%)
175_vpr 97.90% (123.87%)
183_equake 98.28% (112.50%)
197_parser 98.59% (115.23%)
254_gap 98.66% (128.01%)

Top 5 regressions
177_mesa 103.54% (151.85%)
253_perlbmk 101.15% (110.19%)
164_gzip 100.86% (128.13%)
256_bzip2 100.62% (135.00%)
255_vortex 100.35% (100.72%)

geomean changes 99.41% (122.34%)

Thanks,
Kevin

I can give more details on the performance regressions here. Basically, partial unrolling contributes small performance improvement, regressions and code size changes. Most of the regressions are caused by the runtime unrolling prologue. This prologue will do some extra work on checking the loop iterations and execute the reminder times of loop bodies. If the runtime unrolled loop is inside another loop, and inner loop count for each running is quite small, then there's a overhead happened in the prologue and caused the regession.

Maybe we can do some heuristic work to avoid unrolling such loops, but it's hard to decide which kind of heuristic logic should be used here, because most information can help to make this decision is coming from run time, which is difficult to be estimated at compling time. So maybe it need spend a lot of time to tune and the benefit is unknown.

In other hand, the runtime unrolling algorithm is already there and proved to give performance improvments for SPEC2000 and SPEC2006. So I suggest we can enable it first, and then try to get it even better in future.

Regards,
Kevin

hfinkel added a subscriber: hfinkel.Sep 9 2014, 2:51 PM
hfinkel added inline comments.
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
531

I recommend not enabling it this way. Instead, you should set LoopMicroOpBufferSize in the relevant scheduling model. You'll see this is done in lib/Target/X86/X86SchedHaswell.td, for example. This was you can take advantage of the default logic in BasicTargetTransformInfo.cpp (which does things like exclude loops with function calls).

Also, it will force you to pick a threshold, which is really a core-specific property (generally speaking). In my experience, this helps on OOO cores only for small loops. For in-order cores, especially if you use AA during instruction scheduling, it can help for larger loops too (but obviously not *too* large).

kevin.qin updated this revision to Diff 13921.Sep 22 2014, 3:15 AM

Hi,

I proposed to set 20 as loop buffer size for A57 after a lot of benchmarking experiments. Then this number will be used as a threshold to allow partial & runtime unrolling on small loops. The effect on spec2006 is,

Benchmark | Performance Improvement | Code Size Increment
400_perlbench -1.20% 1.22%
401_bzip2 4.59% 0.00%
403_gcc 0.86% 0.36%
433_milc -1.28% 2.71%
444_namd -0.02% 1.32%
445_gobmk 0.38% 0.13%
447_dealII 0.29% 2.48%
450_soplex -1.02% 3.57%
453_povray 2.80% 1.05%
456_hmmer 2.32% 3.59%
458_sjeng 0.04% 0.12%
462_libquantum -0.11% 0.59%
464_h264ref -0.03% 2.98%
470_lbm 0.02% 0.20%
471_omnetpp 2.67% 0.00%
473_astar -1.38% 5.93%
482_sphinx3 -1.38% 3.98%
483_xalancbmk -0.98% 0.71%
GEOMEAN 0.38% 1.70%

I also did experiments on SPEC2000, but there's no significant performance impact. The geomean code size increment on SPEC2000 is 1.76%, and the increased number on all benchmarks are below 5% except 179.art, which is 14%. Fortunately, it could get partly fixed with my loop prologue simplification patch, and got 9.45% code bloat on 179.art.

Overall, this patch can bring 0.4% performance improvement on spec2006 with about 1.8% code size increment, and all code bloats are under 10% after applying another loop prologue simplification patch. So I think it's acceptable for all optimization levels except -Os.

Thanks,
Kevin

Looks like the chip itself has a 32 entry loop buffer (with 2 forward and one backward branch support). What range of values did you check here? (i.e. why is 32 not the right value to put here like with the other port specific constants?)

By the comments, looks like this is just the value that ended up not bloating too much but still giving some performance. I think this is a lot more acceptable than before, though I'm not sure about when this will run. If this ends up only affecting -O2/3 (maybe even O1), it should be fine. But O0, Os and Oz need to skip anything depending on this.

Do we care that the relationship between the two (opt level and buffer size) are undocumented and de-coupled?

kevin.qin updated this revision to Diff 13982.Sep 23 2014, 4:08 AM

Hi Renato,

Thanks for your review.

This patch setted PartialOptSizeThreshold to 0 to disable partial & runtime unrolling on -Os/-Oz. This is implemented in target transform info which makes it effects on aarch64 only. So for A57, the loop buffer size is always 20 at any optimization level, only the way how to use it at different optimization levels makes different.

Cheers,
Kevin

hfinkel added inline comments.Sep 23 2014, 5:45 AM
lib/Target/AArch64/AArch64TargetTransformInfo.cpp
531

Add a space in between the // and the D.

rengolin accepted this revision.Sep 24 2014, 8:38 AM
rengolin added a reviewer: rengolin.

Thanks Kevin. I'm ok with the resulting benchmark numbers. Looking forward for your fix of the remaining 10+% code bloat.

cheers,
--renato

This revision is now accepted and ready to land.Sep 24 2014, 8:38 AM
  • Original Message -----

From: "Renato Golin" <renato.golin@linaro.org>
To: kevinqindev@gmail.com
Cc: hfinkel@anl.gov, echristo@gmail.com, "amara emerson" <amara.emerson@arm.com>, mcrosier@codeaurora.org, "moritz
roth" <moritz.roth@arm.com>, llvm-commits@cs.uiuc.edu
Sent: Wednesday, September 24, 2014 10:38:59 AM
Subject: Re: [PATCH] [AArch64] Enable partial unrolling and runtime unrolling for AArch64 target

Thanks Kevin. I'm ok with the resulting benchmark numbers. Looking
forward for your fix of the remaining 10+% code bloat.

As I recall, this is related to http://reviews.llvm.org/D5147 -- this is on my TODO list to review again, but feel free to help ;)

Thanks again,
Hal

cheers,
--renato

http://reviews.llvm.org/D5148

kevin.qin updated this revision to Diff 14441.Oct 6 2014, 3:09 AM
kevin.qin edited edge metadata.

Hi,

After some performance investigation on smaller granular, I proposed to set 16 as loop buffer size for A57. I will post some experiments result in following comments. Thanks Chandler and Jiangning for their advices.

Cheers,
Kevin

kevin.qin updated this revision to Diff 14568.Oct 8 2014, 6:01 AM

Hi Eric,

The comment in patch is updated.

Thanks,
Kevin

echristo accepted this revision.Oct 8 2014, 12:52 PM
echristo added a reviewer: echristo.

Sure. Seems reasonable.

One comment I noticed below.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
560

Space here :)

Committed at r219401.

Cheers,
Kevin

2014-10-08 20:52 GMT+01:00 Eric Christopher <echristo@gmail.com>:

Sure. Seems reasonable.

One comment I noticed below.

Comment at: lib/Target/AArch64/AArch64TargetTransformInfo.cpp:554
@@ +553,3 @@
+ UnrollingPreferences &UP) const {
+ //Disable partial & runtime unrolling on -Os.

+ UP.PartialOptSizeThreshold = 0;

Space here :)

http://reviews.llvm.org/D5148

rengolin closed this revision.Oct 22 2014, 7:04 AM