Page MenuHomePhabricator

[AArch64] Force runtime unrolling for in-order scheduling models
Needs ReviewPublic

Authored by NickGuy on Mar 4 2021, 6:40 AM.

Details

Diff Detail

Event Timeline

NickGuy created this revision.Mar 4 2021, 6:40 AM
NickGuy requested review of this revision.Mar 4 2021, 6:40 AM
SjoerdMeijer added inline comments.Mar 4 2021, 6:50 AM
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
981

Looks sensible to me that we do this (first) for smaller in-order cores.
From a quick look, quite a few other targets that implement this hook have:

// Scan the loop: don't unroll loops with calls as this could prevent
// inlining.

Do we need that too? Have you benchmarked this, and can we try this too?

Runtime unrolling is always going to be a bit of a heuristic, unfortunately. We don't usually know at compile time what the trip count will be, so don't know if we will end up making things worse.

If the loop has already been vectorized (and possibly interleaved) it will already have been somewhat unrolled though. We may want to be more careful with cases like that and not runtime unroll the loop that has already been vectorized. We do the same thing for MVE, where that is more important due to the low register count and tail predicated loops. But the same reasoning may well apply here too, where unrolling it further just make the loop body too large to be useful.

NickGuy updated this revision to Diff 328978.Mar 8 2021, 5:11 AM

I've done some more benchmarks with the provided suggestions;

Do we need that too? Have you benchmarked this, and can we try this too?

In the benchmark I ran with this change, I saw a slight regression (~0.01%) which given the gains from unrolling is a negligible change.

If the loop has already been vectorized (and possibly interleaved) it will already have been somewhat unrolled though. We may want to be more careful with cases like that and not runtime unroll the loop that has already been vectorized. We do the same thing for MVE, where that is more important due to the low register count and tail predicated loops. But the same reasoning may well apply here too, where unrolling it further just make the loop body too large to be useful

With this change added onto the previous, I saw an improvement of ~0.4% over the unrestricted runtime unrolling.

This will still unroll the remainders of vectorized loops, which will be quite common but generally unhelpful to unroll. Can we try to prevent unrolling there too?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1007

Should we be enabling partial unrolling too? If not, why not?

NickGuy updated this revision to Diff 333009.Wed, Mar 24, 8:53 AM

This will still unroll the remainders of vectorized loops, which will be quite common but generally unhelpful to unroll. Can we try to prevent unrolling there too?

It seems to be difficult to identify the remainder loop; Checking the llvm.loop.isvectorized attribute (like in ARMTargetTransformInfo) caused all gains to be negated, and caused a slight regression when compared to without this change. I've instead included a check for the llvm.loop.unroll.disable attribute, which was seen on the remainder loop IR. This check causes no difference to the benchmark numbers. (Though thinking about it now, that might be due to it being handled elsewhere, making this check redundant)

Should we be enabling partial unrolling too? If not, why not?

Enabling partial unrolling made no difference in the benchmark results, so I haven't enabled it here.

It seems to be difficult to identify the remainder loop; Checking the llvm.loop.isvectorized attribute (like in ARMTargetTransformInfo) caused all gains to be negated, and caused a slight regression when compared to without this change. I've instead included a check for the llvm.loop.unroll.disable attribute, which was seen on the remainder loop IR. This check causes no difference to the benchmark numbers. (Though thinking about it now, that might be due to it being handled elsewhere, making this check redundant)

Yeah it doesn't seem worthwhile to check for something that the loop unroller will already be checking for. The vectorizer will unroll (/"interleave") at the same time as vectorizing, and we can already produce loops that are very wide compared to the input, sometimes to the point that they are never executed and we are only running the remainders. Unrolling on top of that will not be beneficial in a lot of cases, but that will depend heavily on the trip count of the loops.

The very quick benchmarks I ran didn't show this to be great because of all that extra unrolling, mostly of remainder loops by the look of it. I'm sure you have been running more benchmarks, and perhaps the ones here are not very representative of general A64 code.

Enabling partial unrolling made no difference in the benchmark results, so I haven't enabled it here.

Partial unrolling is when the trip count is known but we can unroll a mod of it. I would be very surprised if it didn't come up in any of your testing.