This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Enable partial and runtime unrolling
ClosedPublic

Authored by samparker on Jun 26 2017, 7:33 AM.

Details

Summary

Enable runtime and partial loop unrolling of simple loops without calls, on Thumb2 M-Class cores. The thresholds are calculated based on the Subtarget's issue width.

Diff Detail

Repository
rL LLVM

Event Timeline

samparker created this revision.Jun 26 2017, 7:33 AM

Hi Sam,
Where are the tests? ;-) The ad hoc way of calculating the trip count (or the variable) in getTripCountValue() doesn't feel entirely right, is this not what SCEV should be providing?
Cheers,
Sjoerd.

Forgot to say that it would be good if you can share some numbers why this a good thing to do.

Bah, of course! I will get some tests together. As for the trip count.... yes, I think SCEV handles these things and I was hoping someone could point me to a nicer and more standardised way of getting the value.

cheers,
sam

samparker updated this revision to Diff 104139.Jun 27 2017, 4:52 AM

Added some tests and changed the initial thresholds to be based purely on the issue width of the cpu, which I've added a helper function for in ARMSubtarget.

samparker edited the summary of this revision. (Show Details)Jun 27 2017, 5:27 AM

Hi Sjoerd,

I've been running an industry standard benchmark suite and targeting Cortex-M4 and Cortex-M7. This patch gives modest improvements to many of the benchmarks, as well as some significantly better results. Overall improvements are ~1% for Cortex-M4 and ~3% for Cortex-M7.

cheers,
sam

efriedma added inline comments.
lib/Target/ARM/ARMTargetTransformInfo.cpp
599 ↗(On Diff #104139)

These heuristics seem really weird... why does the loop depth affect the performance of a loop? Why does the trip count of the parent loop affect the performance of a loop?

I mean, I can imagine these heuristics are profitable for your particular benchmarks, but it seems like a performance trap: for example, someone turns on LTO, so a function gets inlined into a loop, so we decide it's no longer profitable to unroll the inner loop, and we lose a bunch of performance.

samparker added inline comments.Jun 28 2017, 1:40 AM
lib/Target/ARM/ARMTargetTransformInfo.cpp
599 ↗(On Diff #104139)

Hi Eli,

The depth check here is just for an early exit. For nested loops, I'm checking whether the trip count of the inner loop will be the same for each iteration of the outer loops, it's not the actual trip count that affects performance.

I have found that its not profitable to unroll loops with calls because it can prevent the inlining from happening, so unrolling will not happen in those cases and I hope that the inlining does occur. Some inlining could push the cost above the unrolling threshold, but that doesn't matter because we're still getting the inlining performance gains. The thresholds have been chosen to unroll small loops where the overhead of the backedge, on m-class cores, is significant.

cheers,
sam

fhahn added a subscriber: fhahn.Jun 28 2017, 1:44 AM
samparker updated this revision to Diff 104378.Jun 28 2017, 3:37 AM

Amended some comments.

efriedma added inline comments.Jun 28 2017, 11:32 AM
lib/Target/ARM/ARMTargetTransformInfo.cpp
599 ↗(On Diff #104139)

The depth check here is just for an early exit. For nested loops, I'm checking whether the trip count of the inner loop will be the same for each iteration of the outer loops,

This is still problematic. You're basing the question of whether to unroll the loop based on code which is not part of the loop, which means the end result is going to be sensitive to unrelated inlining decisions.


Scanning the loop for calls seems fine.

test/CodeGen/ARM/loop-unrolling.ll
152 ↗(On Diff #104378)

Why do we want to avoid unrolling here? At first glance, this looks like it should be profitable to unroll (more scheduler freedom, avoid branches).

samparker updated this revision to Diff 104667.Jun 29 2017, 9:36 AM
samparker edited the summary of this revision. (Show Details)

Updated to use the freshly introduced scalar evolution and also simplified the heuristics by not checking the loop depth.

samparker added inline comments.Jun 29 2017, 10:51 AM
test/CodeGen/ARM/loop-unrolling.ll
152 ↗(On Diff #104378)

Enabling runtime unrolling can generate more spill code, and for small runtime counts the unrolled version may not even be executed that frequently, so I found that unrolling inner loops whose variable trip count is computed in the parent loop can cause some very significant regressions.

I should also say that even though the average performance improvement is modest, some tested benchmarks see an improvement of up to 40, 50 and 80% for Cortex-M4, M7 and M33.

I don't see how a trip count which varies for each iteration of the parent loop implies the trip count is small. I mean, it's possible it does for your particular benchmarks, but it seems unlikely to generalize to other code.

lib/Target/ARM/ARMTargetTransformInfo.cpp
585 ↗(On Diff #104667)

There isn't any reason to check isLoopInvariant for each loop: if the count is invariant relative to the topmost loop, it will be invariant relative to every loop it contains.

Hi Eli,

Your comments make sense to me, so I ran an example to figure out if this heuristic was indeed nonsense. Here's the example kernel:

for (unsigned i = 0; i < max; ++i) {
  acc = 0;
  innerMax = dataSize - i;
  for (unsigned j = 0; j < innerMax; ++j) {
    acc += (input[j] * input[i+j]) >> scaleValue;
  }

The results in the graph show that often the unrolled version is faster, but the net affect across the data set is that unrolling is detrimental on performance. My other benchmark results also show that having this restriction doesn't negatively impact the performance, so I think that including the heuristic to prevent unrolling is valid.

cheers,
sam

Sam, interesting performance numbers! It looks like we see some strange bimodal behaviour: (big) improvements are canceled out by (big) regressions. Assuming that codegen for this function is the same here (just the value of dataSize is different), I am wondering if we are not just looking at some micro-architecture weirdness? So I don't think we can draw any conclusions from these numbers. Do we need more data points?

Yes good point. Those numbers were running on the M7, which has a branch predicator... so I will run it again on something more simple.

There are two kinds of SCEV expressions which are not invariant: SCEVAddRecExpr, and SCEVUnknown. If you have a SCEVAddRecExpr, you can show the loop nest is similar to your testcase, and I guess that might change the profitability of unrolling. If you have a SCEVUnknown, I can't see how you would conclude anything useful; there's a strong possibility the trip count is actually constant, but the compiler can't prove it because of aliasing or something like that.

samparker updated this revision to Diff 105842.Jul 10 2017, 5:38 AM

After more testing, I've found that the variant inner loop check was unnecessary so I've removed that check.

efriedma added inline comments.Jul 10 2017, 3:56 PM
lib/Target/ARM/ARMTargetTransformInfo.cpp
552 ↗(On Diff #105842)

If you want this to be a no-op for other architectures, shouldn't this be "return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP)"?

559 ↗(On Diff #105842)

Do we really call into getUnrollingPreferences for the early unroll pass? We probably don't want to be using target-specific unroll heuristics for "createSimpleLoopUnrollPass". But I guess that's not something you need to fix in this patch.

lib/Target/ARM/ARMTargetTransformInfo.h
126 ↗(On Diff #105842)

"override"?

samparker added inline comments.Jul 11 2017, 9:30 AM
lib/Target/ARM/ARMTargetTransformInfo.cpp
552 ↗(On Diff #105842)

ah yes, thanks.

559 ↗(On Diff #105842)

As far as I can see, the unroll pass functionality can only be controlled externally via the command line options, or by target hooks.

lib/Target/ARM/ARMTargetTransformInfo.h
126 ↗(On Diff #105842)

oddly no, it's not virtual.

samparker updated this revision to Diff 106052.Jul 11 2017, 9:34 AM

Added the default call to the base implementation.

efriedma added inline comments.Jul 11 2017, 10:50 AM
test/CodeGen/ARM/loop-unrolling.ll
1 ↗(On Diff #106052)

We usually put tests for IR transforms into test/Transforms/; for unrolling in particular, that would be something like test/Transforms/LoopUnroll/ARM/ (you'll have to create a new directory, and make a lit.local.cfg so the test is only enabled if we have an ARM target).

And I'd like to see some basic test to show that this doesn't change the unroll heuristics on, for example, cortex-a57 (where BasicTTIImplBase::getUnrollingPreferences currently changes the unroll threshold).

samparker updated this revision to Diff 106443.Jul 13 2017, 9:46 AM

I've moved the test to the suggested directory and added some more targets to it. I've also now enabled the unrolling for Thumb targets and removed the use of the issue width in the calculation.

samparker updated this revision to Diff 106445.Jul 13 2017, 9:50 AM

Now added the test config file!

efriedma added inline comments.Jul 17 2017, 5:39 PM
test/Transforms/LoopUnroll/ARM/loop-unrolling.ll
5 ↗(On Diff #106445)

Could you add a run to check -mcpu=cortex-a57, to verify this we're still inheriting unroller heuristics from BasicTTIImplBase?

samparker updated this revision to Diff 107305.Jul 19 2017, 8:10 AM

Hi Eli,

I've updated the tests to target the A57 for both A32 and T32 mode.

I've actually now benchmarked the A53 and A57 with this patch and the results are favourable, but there is a bug I need to fix first before I enable it. That will be another patch.

cheers,
sam

This revision is now accepted and ready to land.Jul 19 2017, 12:26 PM
This revision was automatically updated to reflect the committed changes.