This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Improve loop unrolling for Cortex-M
ClosedPublic

Authored by samparker on Aug 1 2017, 3:28 AM.

Details

Summary

Override the default unroll count to 4 and use the UnrollRemainder option. Disable unrolling on Thumb1 only targets.

Diff Detail

Event Timeline

samparker created this revision.Aug 1 2017, 3:28 AM
samparker updated this revision to Diff 109098.Aug 1 2017, 6:29 AM

Slightly lowered the threshold for forced unrolling.

Diff is missing context.

lib/Target/ARM/ARMTargetTransformInfo.cpp
596

getUserCost()?

623

Have you experimented with setting "UP.Runtime = false"?

samparker added inline comments.Aug 2 2017, 12:26 AM
lib/Target/ARM/ARMTargetTransformInfo.cpp
596

This has crossed my mind, I'll look into it.

623

Yes, but there's a lot of performance to be gained by enabling runtime. This patch is trying to get some of that improvement, while minimising the negative affect it can have.

samparker updated this revision to Diff 109341.Aug 2 2017, 7:08 AM

Hi Eli,

Now using the TTI cost model for instruction counting. I've also now removed support for Thumb-1 targets as I am starting to loose the will trying to characterise their behaviour!

cheers,
sam

efriedma added inline comments.Aug 2 2017, 11:17 AM
lib/Target/ARM/ARMTargetTransformInfo.cpp
621

This seems like you're running into some sort of limitation of unrolling infrastructure. Maybe we need to add a feature to unroll remainder loops?

Also, which function in the test covers this codepath?

samparker added inline comments.Aug 3 2017, 3:00 AM
lib/Target/ARM/ARMTargetTransformInfo.cpp
621

The integer type check isn't actually tested and wasn't something that I was interested in, so I will remove it.

I'm not sure I understand what you mean. For clarity and posterity, the type check on the SCEV is not querying the number of iterations but whether the expression is based on int, float, etc... values. Currently, there is a TODO in the unroller to handle counts with pointer types. The runtime unroller creates a unrolled body and just uses an if-else statement to execute the correct loop, but the original loop is also called after the unrolled loop for the remaining iterations (N % unroll_count). The runtime unroller, by default, will only unroll the loops for which SCEV can produce a trip count because it can guarantee than the basic block can be duplicated and merged. Otherwise, the body can be duplicated but the basic blocks cannot be merged. The iterate_inc function is what tests this and hopefully highlights the problem that the loop count is dependent on the length of the linked list and SCEV cannot be expected to be able express this.

samparker updated this revision to Diff 109507.Aug 3 2017, 3:05 AM

Hi Eli,

I've removed the integer type check of the trip expression. Please see my inline reply, sorry if I'm stating the obvious but I just wanted to try and avoid any unnecessary misunderstanding with delayed back-and-forth.

Many thanks,
sam

Okay, I can try to expand on the "limitation of unrolling infrastructure" bit. The question is, given that I have a small loop, and have an expression for the trip count of the loop, what's the optimal way to generate code for the loop?

When the iteration count is generally large, and the CPU doesn't have special hardware for looping quickly, we want to unroll a bunch of times to reduce the iteration overhead as much as possible, then generate a tiny remainder loop (where the performance doesn't really matter).

Okay, but what happens when the iteration count is small, but the loop runs many times? The runtime-unrolled version of the loop is completely useless, and we end up spending most of our time in the remainder loop. So what can we do about that? One option is to fully unroll the remainder loop: we know the maximum trip count, so it's a straightforward unroll operation. Or maybe we could runtime-unroll the remainder loop (and generate a remainder loop for the remainder loop). Or maybe we could try to do something fancy with a switch. I'm not sure what option is best without actually testing it. But there are definitely options here, and we can probably do better than just setting "DefaultUnrollRuntimeCount = 2" to dodge the issue.

Thanks for the clarification and the ideas. This sounded like loop peeling to me so it's what I've used, could you take a look at D36309? I know it doesn't have a test case, I would like your feedback before I go too far. It solves the issues that I have observed though.

thanks,
sam

samparker updated this revision to Diff 111148.Aug 15 2017, 5:21 AM
samparker edited the summary of this revision. (Show Details)

Hi Eli,

I've updated the patch to use the UnrollRemainder option and I've set the default count to 4 across the cores.

thanks,
sam

efriedma accepted this revision.Aug 15 2017, 12:08 PM
efriedma added subscribers: zzheng, qcolombet.

LGTM.

This revision is now accepted and ready to land.Aug 15 2017, 12:08 PM
This revision was automatically updated to reflect the committed changes.