This is an archive of the discontinued LLVM Phabricator instance.

[AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64
AbandonedPublic

Authored by flyingforyou on Dec 9 2015, 11:26 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
zzheng
hfinkel

Summary

On AArch64, the benefits of unrolling are more significant than the expense of division to compute the trip count.
so we're more able to tolerate the expense, on average, of a division to compute the trip count.

Diff Detail

Event Timeline

flyingforyou updated this revision to Diff 42390.Dec 9 2015, 11:26 PM

flyingforyou retitled this revision from to [AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64.

flyingforyou updated this object.

flyingforyou added reviewers: mcrosier, t.p.northover, hfinkel.

flyingforyou added a subscriber: llvm-commits.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptDec 9 2015, 11:26 PM

zzheng added a reviewer: zzheng.Dec 10 2015, 11:00 AM

zzheng added a subscriber: zzheng.

Capital 'T' at the start of comment.

Do you have performance data showing the benefits?

Addressed Zhaoshi's comment.

Thanks Zhaoshi.

I've just run a bunch of benchmarking including test-suite on Juno(Cortex-A57), there were many improvements and some regressions.
The performance results of test-suite show 1.33% improvement and incur 0.78% regression.
To compute composite benchmark result value, geometric mean is used.

Actually I found some regression after merging r234846.
url: http://reviews.llvm.org/D8994

After this commit merged, @hfinkel upload new commit r237947.

On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable, the expense of a division to compute the trip count could greatly outweigh the benefits. On the A2, we unroll a lot, and the benefits of unrolling are more significant (seeing a 5x or 6x speedup is not uncommon), so we're more able to tolerate the expense, on average, of adivision to compute the trip count.

I totally agree with this comment. Most of AArch64 Cores support h/w divider including floating point. So I think we can have unrolling oppotunity more.

I see. LGTM. You can merge it if no one objects.

After this commit merged, @hfinkel upload new commit r237947.

On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable, the expense of a division to compute the trip count could greatly outweigh the benefits. On the A2, we unroll a lot, and the benefits of unrolling are more significant (seeing a 5x or 6x speedup is not uncommon), so we're more able to tolerate the expense, on average, of adivision to compute the trip count.

I totally agree with this comment. Most of AArch64 Cores support h/w divider including floating point. So I think we can have unrolling oppotunity more.

Hmm, I don't know how hfinkel's comment supports your case. I can see how the trade-off is beneficial for his case of an in-order processors but not for an out-of-order. Did you run SPEC?

Thanks Adam.

I just wan to say on AArch64, division cost is not high for forgiving unrolling oppotunity.
so I just agreed with

we unroll a lot, and the benefits of unrolling are more significant, so we're more able to tolerate the expense, on average, of adivision to compute the trip count.

I just ran Single & MultiSource Benchmarks. I am not sure that I can run SPEC or not in a few days.

I am curious about performance regression on Cyclone. Is this patch harmful for Cyclone or Twister?

-Junmo

Hi,

I just wan to say on AArch64, division cost is not high for forgiving unrolling oppotunity.

I'm not sure about this. If the loop is inside a loop nest and the inner loop trip count is low, the division could well become significant as on many cores it is not pipelined.

To me, it seems that without any more context (i.e. if we don't have PGO information) it would be better to be more conservative.

James

FWIW, I very much agree with James.

Hi Junmo,

I tried out your patch on top of r254864, on a juno board, running on
Cortex-A57.
I see the following results:

Performance Improvements - Execution Time Δ
lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.15=3
-23.07%
lnt.SingleSource/Benchmarks/Shootout/sieve
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.40=3
-9.50%
lnt.SingleSource/Benchmarks/BenchmarkGame/nsieve-bits
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.9=3
-7.26%
lnt.SingleSource/Benchmarks/BenchmarkGame/recursive
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.316=3
-3.42%
spec.cpu2006.ref.433_milc
http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.235=3
-1.12%

While there are a few big jumps in the test-suite, I think the
regressions show this is not
uniformely an improvement for performance.

Thanks,

Kristof

Thanks James.

I have a question about your opinion.

I'm not sure about this. If the loop is inside a loop nest and the inner loop trip count is low, the division could well become significant as on many cores it is not pipelined.

I think your opinion assume that loop unroll's positive effect is nothing. I think even if the inner loop trip count is low, when loop is unrolled there are more chances to optimize the loop by other passes.

Junmo.

Hi Kristof.

I am really appreciate your experimental result.

On my experiment, there were many improvements and regressions. So I compute result value using geometric mean.

Your experiment shows 5.71% improvement and 2.61% regression.

Junmo.

Hi Junmo,

I think it isn't possible to correctly compute the geomean without taking into account the execution times of all programs that are part of the suite.
I've done the computation of the effect of this patch on geomeans for the test-suite, spec2000 and spec2006, on Cortex-A53 and Cortex-A57, taking into account the execution time of all the programs in the suite.
I've used the multi-sampling infrastructure in LNT to run each program multiple times and took the median value of all program runs for a given program.

This gives me the following results - where numbers larger than 100% mean the patch gives a speedup, and numbers lower than 100% mean the patch gives a slow-down:

On Cortex-A57:
lnt/test-suite: 99.9%
spec2000: 100.3%
spec2006: 100.3%

On Cortex-A53:
lnt/test-suite: 99.8%
spec2000: 99.8%
spec2006: 99.9%

Furthermore, on a number of other commercial benchmark suites I also saw 0.2% to 0.6% regressions in performance on the overall benchmark score.
I think these measurements show that the patch overall results in a slight regression in performance.
Therefore, as is, I don't think the patch should be committed.
I'm wondering if there is any scope to make the AllowExpensiveTripCount heuristics smarter or more selective based on e.g. code characteristics?
Although the comments from James and Chad earlier indicate that that may need the compiler to guess very well how many iterations there are in inner loops, which is probably very hard without PGO information.

Thanks,

Kristof

Thanks for detailed experiment result, Kristof.

Therefore, as is, I don't think the patch should be committed.
I'm wondering if there is any scope to make the AllowExpensiveTripCount heuristics smarter or more selective based on e.g. code characteristics?
Although the comments from James and Chad earlier indicate that that may need the compiler to guess very well how many iterations there are in inner loops, which is probably very hard without PGO information.

I agree with your opinon. so I will abandon this patch and upload new patch soon.

Junmo.

flyingforyou abandoned this revision.Dec 15 2015, 4:43 PM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64TargetTransformInfo.cpp

4 lines

Diff 42497

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 484 Lines • ▼ Show 20 Lines	void AArch64TTIImpl::getUnrollingPreferences(Loop *L,
// For inner loop, it is more likely to be a hot one, and the runtime check		// For inner loop, it is more likely to be a hot one, and the runtime check
// can be promoted out from LICM pass, so the overhead is less, let's try		// can be promoted out from LICM pass, so the overhead is less, let's try
// a larger threshold to unroll more loops.		// a larger threshold to unroll more loops.
if (L->getLoopDepth() > 1)		if (L->getLoopDepth() > 1)
UP.PartialThreshold *= 2;		UP.PartialThreshold *= 2;

// Disable partial & runtime unrolling on -Os.		// Disable partial & runtime unrolling on -Os.
UP.PartialOptSizeThreshold = 0;		UP.PartialOptSizeThreshold = 0;

		// The benefits of unrolling often outweigh the cost of a division to compute
		// the trip count.
		UP.AllowExpensiveTripCount = true;
}		}

Value AArch64TTIImpl::getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,		Value AArch64TTIImpl::getOrCreateResultFromMemIntrinsic(IntrinsicInst Inst,
Type *ExpectedType) {		Type *ExpectedType) {
switch (Inst->getIntrinsicID()) {		switch (Inst->getIntrinsicID()) {
default:		default:
return nullptr;		return nullptr;
case Intrinsic::aarch64_neon_st2:		case Intrinsic::aarch64_neon_st2:
▲ Show 20 Lines • Show All 73 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 42497

lib/Target/AArch64/AArch64TargetTransformInfo.cpp

[AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64
AbandonedPublic