Override the default unroll count to 4 and use the UnrollRemainder option. Disable unrolling on Thumb1 only targets.
Details
Diff Detail
Event Timeline
Hi Eli,
Now using the TTI cost model for instruction counting. I've also now removed support for Thumb-1 targets as I am starting to loose the will trying to characterise their behaviour!
cheers,
sam
lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
619 | This seems like you're running into some sort of limitation of unrolling infrastructure. Maybe we need to add a feature to unroll remainder loops? Also, which function in the test covers this codepath? |
lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
619 | The integer type check isn't actually tested and wasn't something that I was interested in, so I will remove it. I'm not sure I understand what you mean. For clarity and posterity, the type check on the SCEV is not querying the number of iterations but whether the expression is based on int, float, etc... values. Currently, there is a TODO in the unroller to handle counts with pointer types. The runtime unroller creates a unrolled body and just uses an if-else statement to execute the correct loop, but the original loop is also called after the unrolled loop for the remaining iterations (N % unroll_count). The runtime unroller, by default, will only unroll the loops for which SCEV can produce a trip count because it can guarantee than the basic block can be duplicated and merged. Otherwise, the body can be duplicated but the basic blocks cannot be merged. The iterate_inc function is what tests this and hopefully highlights the problem that the loop count is dependent on the length of the linked list and SCEV cannot be expected to be able express this. |
Hi Eli,
I've removed the integer type check of the trip expression. Please see my inline reply, sorry if I'm stating the obvious but I just wanted to try and avoid any unnecessary misunderstanding with delayed back-and-forth.
Many thanks,
sam
Okay, I can try to expand on the "limitation of unrolling infrastructure" bit. The question is, given that I have a small loop, and have an expression for the trip count of the loop, what's the optimal way to generate code for the loop?
When the iteration count is generally large, and the CPU doesn't have special hardware for looping quickly, we want to unroll a bunch of times to reduce the iteration overhead as much as possible, then generate a tiny remainder loop (where the performance doesn't really matter).
Okay, but what happens when the iteration count is small, but the loop runs many times? The runtime-unrolled version of the loop is completely useless, and we end up spending most of our time in the remainder loop. So what can we do about that? One option is to fully unroll the remainder loop: we know the maximum trip count, so it's a straightforward unroll operation. Or maybe we could runtime-unroll the remainder loop (and generate a remainder loop for the remainder loop). Or maybe we could try to do something fancy with a switch. I'm not sure what option is best without actually testing it. But there are definitely options here, and we can probably do better than just setting "DefaultUnrollRuntimeCount = 2" to dodge the issue.
Thanks for the clarification and the ideas. This sounded like loop peeling to me so it's what I've used, could you take a look at D36309? I know it doesn't have a test case, I would like your feedback before I go too far. It solves the issues that I have observed though.
thanks,
sam
Hi Eli,
I've updated the patch to use the UnrollRemainder option and I've set the default count to 4 across the cores.
thanks,
sam
getUserCost()?