Enable runtime and partial loop unrolling of simple loops without calls, on Thumb2 M-Class cores. The thresholds are calculated based on the Subtarget's issue width.
Details
Diff Detail
Event Timeline
Hi Sam,
Where are the tests? ;-) The ad hoc way of calculating the trip count (or the variable) in getTripCountValue() doesn't feel entirely right, is this not what SCEV should be providing?
Cheers,
Sjoerd.
Forgot to say that it would be good if you can share some numbers why this a good thing to do.
Bah, of course! I will get some tests together. As for the trip count.... yes, I think SCEV handles these things and I was hoping someone could point me to a nicer and more standardised way of getting the value.
cheers,
sam
Added some tests and changed the initial thresholds to be based purely on the issue width of the cpu, which I've added a helper function for in ARMSubtarget.
Hi Sjoerd,
I've been running an industry standard benchmark suite and targeting Cortex-M4 and Cortex-M7. This patch gives modest improvements to many of the benchmarks, as well as some significantly better results. Overall improvements are ~1% for Cortex-M4 and ~3% for Cortex-M7.
cheers,
sam
lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
599 | These heuristics seem really weird... why does the loop depth affect the performance of a loop? Why does the trip count of the parent loop affect the performance of a loop? I mean, I can imagine these heuristics are profitable for your particular benchmarks, but it seems like a performance trap: for example, someone turns on LTO, so a function gets inlined into a loop, so we decide it's no longer profitable to unroll the inner loop, and we lose a bunch of performance. |
lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
599 | Hi Eli, The depth check here is just for an early exit. For nested loops, I'm checking whether the trip count of the inner loop will be the same for each iteration of the outer loops, it's not the actual trip count that affects performance. I have found that its not profitable to unroll loops with calls because it can prevent the inlining from happening, so unrolling will not happen in those cases and I hope that the inlining does occur. Some inlining could push the cost above the unrolling threshold, but that doesn't matter because we're still getting the inlining performance gains. The thresholds have been chosen to unroll small loops where the overhead of the backedge, on m-class cores, is significant. cheers, |
lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
599 |
This is still problematic. You're basing the question of whether to unroll the loop based on code which is not part of the loop, which means the end result is going to be sensitive to unrelated inlining decisions. Scanning the loop for calls seems fine. | |
test/CodeGen/ARM/loop-unrolling.ll | ||
153 | Why do we want to avoid unrolling here? At first glance, this looks like it should be profitable to unroll (more scheduler freedom, avoid branches). |
Updated to use the freshly introduced scalar evolution and also simplified the heuristics by not checking the loop depth.
test/CodeGen/ARM/loop-unrolling.ll | ||
---|---|---|
153 | Enabling runtime unrolling can generate more spill code, and for small runtime counts the unrolled version may not even be executed that frequently, so I found that unrolling inner loops whose variable trip count is computed in the parent loop can cause some very significant regressions. |
I should also say that even though the average performance improvement is modest, some tested benchmarks see an improvement of up to 40, 50 and 80% for Cortex-M4, M7 and M33.
I don't see how a trip count which varies for each iteration of the parent loop implies the trip count is small. I mean, it's possible it does for your particular benchmarks, but it seems unlikely to generalize to other code.
lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
585 | There isn't any reason to check isLoopInvariant for each loop: if the count is invariant relative to the topmost loop, it will be invariant relative to every loop it contains. |
Hi Eli,
Your comments make sense to me, so I ran an example to figure out if this heuristic was indeed nonsense. Here's the example kernel:
for (unsigned i = 0; i < max; ++i) { acc = 0; innerMax = dataSize - i; for (unsigned j = 0; j < innerMax; ++j) { acc += (input[j] * input[i+j]) >> scaleValue; }
The results in the graph show that often the unrolled version is faster, but the net affect across the data set is that unrolling is detrimental on performance. My other benchmark results also show that having this restriction doesn't negatively impact the performance, so I think that including the heuristic to prevent unrolling is valid.
cheers,
sam
Sam, interesting performance numbers! It looks like we see some strange bimodal behaviour: (big) improvements are canceled out by (big) regressions. Assuming that codegen for this function is the same here (just the value of dataSize is different), I am wondering if we are not just looking at some micro-architecture weirdness? So I don't think we can draw any conclusions from these numbers. Do we need more data points?
Yes good point. Those numbers were running on the M7, which has a branch predicator... so I will run it again on something more simple.
There are two kinds of SCEV expressions which are not invariant: SCEVAddRecExpr, and SCEVUnknown. If you have a SCEVAddRecExpr, you can show the loop nest is similar to your testcase, and I guess that might change the profitability of unrolling. If you have a SCEVUnknown, I can't see how you would conclude anything useful; there's a strong possibility the trip count is actually constant, but the compiler can't prove it because of aliasing or something like that.
After more testing, I've found that the variant inner loop check was unnecessary so I've removed that check.
lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
552 | If you want this to be a no-op for other architectures, shouldn't this be "return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP)"? | |
559 | Do we really call into getUnrollingPreferences for the early unroll pass? We probably don't want to be using target-specific unroll heuristics for "createSimpleLoopUnrollPass". But I guess that's not something you need to fix in this patch. | |
lib/Target/ARM/ARMTargetTransformInfo.h | ||
126 | "override"? |
test/CodeGen/ARM/loop-unrolling.ll | ||
---|---|---|
1 | We usually put tests for IR transforms into test/Transforms/; for unrolling in particular, that would be something like test/Transforms/LoopUnroll/ARM/ (you'll have to create a new directory, and make a lit.local.cfg so the test is only enabled if we have an ARM target). And I'd like to see some basic test to show that this doesn't change the unroll heuristics on, for example, cortex-a57 (where BasicTTIImplBase::getUnrollingPreferences currently changes the unroll threshold). |
I've moved the test to the suggested directory and added some more targets to it. I've also now enabled the unrolling for Thumb targets and removed the use of the issue width in the calculation.
test/Transforms/LoopUnroll/ARM/loop-unrolling.ll | ||
---|---|---|
5 ↗ | (On Diff #106445) | Could you add a run to check -mcpu=cortex-a57, to verify this we're still inheriting unroller heuristics from BasicTTIImplBase? |
Hi Eli,
I've updated the tests to target the A57 for both A32 and T32 mode.
I've actually now benchmarked the A53 and A57 with this patch and the results are favourable, but there is a bug I need to fix first before I enable it. That will be another patch.
cheers,
sam
"override"?