This is an archive of the discontinued LLVM Phabricator instance.

[AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64
AbandonedPublic

Authored by flyingforyou on Dec 9 2015, 11:26 PM.

Details

Summary

On AArch64, the benefits of unrolling are more significant than the expense of division to compute the trip count.
so we're more able to tolerate the expense, on average, of a division to compute the trip count.

Diff Detail

Event Timeline

flyingforyou retitled this revision from to [AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64.
flyingforyou updated this object.
flyingforyou added a subscriber: llvm-commits.
zzheng added a subscriber: zzheng.
zzheng edited edge metadata.Dec 10 2015, 11:03 AM

Capital 'T' at the start of comment.

Do you have performance data showing the benefits?

flyingforyou edited edge metadata.

Addressed Zhaoshi's comment.

Thanks Zhaoshi.

I've just run a bunch of benchmarking including test-suite on Juno(Cortex-A57), there were many improvements and some regressions.
The performance results of test-suite show 1.33% improvement and incur 0.78% regression.
To compute composite benchmark result value, geometric mean is used.

Actually I found some regression after merging r234846.
url: http://reviews.llvm.org/D8994

After this commit merged, @hfinkel upload new commit r237947.

On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable, the expense of a division to compute the trip count could greatly outweigh the benefits. On the A2, we unroll a lot, and the benefits of unrolling are more significant (seeing a 5x or 6x speedup is not uncommon), so we're more able to tolerate the expense, on average, of adivision to compute the trip count.

I totally agree with this comment. Most of AArch64 Cores support h/w divider including floating point. So I think we can have unrolling oppotunity more.

I see. LGTM. You can merge it if no one objects.

anemet added a subscriber: anemet.Dec 10 2015, 11:17 PM

After this commit merged, @hfinkel upload new commit r237947.

On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable, the expense of a division to compute the trip count could greatly outweigh the benefits. On the A2, we unroll a lot, and the benefits of unrolling are more significant (seeing a 5x or 6x speedup is not uncommon), so we're more able to tolerate the expense, on average, of adivision to compute the trip count.

I totally agree with this comment. Most of AArch64 Cores support h/w divider including floating point. So I think we can have unrolling oppotunity more.

Hmm, I don't know how hfinkel's comment supports your case. I can see how the trade-off is beneficial for his case of an in-order processors but not for an out-of-order. Did you run SPEC?

Thanks Adam.

I just wan to say on AArch64, division cost is not high for forgiving unrolling oppotunity.
so I just agreed with

we unroll a lot, and the benefits of unrolling are more significant, so we're more able to tolerate the expense, on average, of adivision to compute the trip count.

I just ran Single & MultiSource Benchmarks. I am not sure that I can run SPEC or not in a few days.

I am curious about performance regression on Cyclone. Is this patch harmful for Cyclone or Twister?

-Junmo

Hi,

I just wan to say on AArch64, division cost is not high for forgiving unrolling oppotunity.

I'm not sure about this. If the loop is inside a loop nest and the inner loop trip count is low, the division could well become significant as on many cores it is not pipelined.

To me, it seems that without any more context (i.e. if we don't have PGO information) it would be better to be more conservative.

James

mcrosier resigned from this revision.Dec 11 2015, 8:29 AM
mcrosier removed a reviewer: mcrosier.

FWIW, I very much agree with James.

Hi Junmo,

I tried out your patch on top of r254864, on a juno board, running on
Cortex-A57.
I see the following results:

Performance Regressions - Execution Time Δ
lnt.MultiSource/Benchmarks/Ptrdist/yacr2/yacr2
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.170=3
9.17%
lnt.SingleSource/Benchmarks/Shootout-C++/ackermann
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.264=3
8.02%
lnt.MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.149=3
4.78%
spec.cpu2006.ref.445_gobmk
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.176=3
1.84%
spec.cpu2006.ref.483_xalancbmk
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.94=3
1.75%
spec.cpu2006.ref.471_omnetpp
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.294=3
1.43%
spec.cpu2000.ref.253_perlbmk
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.337=3
1.22%
lnt.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.135=3
1.10%

Performance Improvements - Execution Time Δ
lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.15=3
-23.07%
lnt.SingleSource/Benchmarks/Shootout/sieve
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.40=3
-9.50%
lnt.SingleSource/Benchmarks/BenchmarkGame/nsieve-bits
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.9=3
-7.26%
lnt.SingleSource/Benchmarks/BenchmarkGame/recursive
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.316=3
-3.42%
spec.cpu2006.ref.433_milc
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.235=3
-1.12%

While there are a few big jumps in the test-suite, I think the
regressions show this is not
uniformely an improvement for performance.

Thanks,

Kristof

Thanks James.

I have a question about your opinion.

I'm not sure about this. If the loop is inside a loop nest and the inner loop trip count is low, the division could well become significant as on many cores it is not pipelined.

I think your opinion assume that loop unroll's positive effect is nothing. I think even if the inner loop trip count is low, when loop is unrolled there are more chances to optimize the loop by other passes.

Junmo.

Hi Kristof.

I am really appreciate your experimental result.

On my experiment, there were many improvements and regressions. So I compute result value using geometric mean.

Your experiment shows 5.71% improvement and 2.61% regression.

Junmo.

Hi Junmo,

I think it isn't possible to correctly compute the geomean without taking into account the execution times of all programs that are part of the suite.
I've done the computation of the effect of this patch on geomeans for the test-suite, spec2000 and spec2006, on Cortex-A53 and Cortex-A57, taking into account the execution time of all the programs in the suite.
I've used the multi-sampling infrastructure in LNT to run each program multiple times and took the median value of all program runs for a given program.

This gives me the following results - where numbers larger than 100% mean the patch gives a speedup, and numbers lower than 100% mean the patch gives a slow-down:

On Cortex-A57:
lnt/test-suite: 99.9%
spec2000: 100.3%
spec2006: 100.3%

On Cortex-A53:
lnt/test-suite: 99.8%
spec2000: 99.8%
spec2006: 99.9%

Furthermore, on a number of other commercial benchmark suites I also saw 0.2% to 0.6% regressions in performance on the overall benchmark score.
I think these measurements show that the patch overall results in a slight regression in performance.
Therefore, as is, I don't think the patch should be committed.
I'm wondering if there is any scope to make the AllowExpensiveTripCount heuristics smarter or more selective based on e.g. code characteristics?
Although the comments from James and Chad earlier indicate that that may need the compiler to guess very well how many iterations there are in inner loops, which is probably very hard without PGO information.

Thanks,

Kristof

Thanks for detailed experiment result, Kristof.

Therefore, as is, I don't think the patch should be committed.
I'm wondering if there is any scope to make the AllowExpensiveTripCount heuristics smarter or more selective based on e.g. code characteristics?
Although the comments from James and Chad earlier indicate that that may need the compiler to guess very well how many iterations there are in inner loops, which is probably very hard without PGO information.

I agree with your opinon. so I will abandon this patch and upload new patch soon.

Junmo.

flyingforyou abandoned this revision.Dec 15 2015, 4:43 PM