As for now, the loop is unrolled if the profile-based trip count is not greater than a limit, which is 7 by default; i.e. no iterations are peeled off for a loop that has the estimation of 8. The patch is based on the idea that a bit longer loop can still benefit from extracting several first iterations. The longer the loop, the lower the effect of the extraction, so if the trip count is estimated to be more than twice the threshold, the optimization is not applied.
Diff Detail
Event Timeline
A bit of the context. After D71990, we noticed that one of our tests became slower and did not benefit from PGO. The test has a loop that is executed exactly 8 times, and, before the patch, it was (partially) unrolled but after the patch, the loop is preserved in the compact form.
For loops with trip count between maxPeelCount and 2*maxPeelCount, this patch enables partial peeling of the loop for maxPeelCount iterations. This feels very aggressive and can be detrimental to performance: 1) increased icache footprint; 2) more pressure to BPU and 3) the back branch of the remaining loop may become less predictable.
What is the motivation case for such a change?
I've described the motivation in my first comment. I believe that the concerns are the same for any peeling strategy based on the trip count. And if there is a consensus that a loop with 7 trip counts can be optimized, why 8-trip-count loop can't be optimized in the same way? So, the patch provides a smooth transition between what is optimized and what is not. That should work better than a strict border.
Have you analyzed the root cause of the performance regression (with performance counters)? Is it due to other missed optimizations without the peeling or is it due to side effects such as loop alignment?
Changes like this one is very likely to cause regressions in other workload, so it might better to solve the problem with tuning options (which are there for a reason). Longer term, I'd like to see more sophisticated cost/benefit analysis implemented which will also be target dependent.
Our test was for AArch64, so there should be no effects like loop alignment, if I understand you right. I couldn't find a benchmark to test the change against, so I prepared an artificial one based on LLVM test suite/SingleSource/Benchmarks/Adobe-C++/loop-unroll.cpp. I saw the performance gain 10%-30% depending on the iterations count (on x86_64). But I understand that that test is probably biased, so if you recommend some commonly recognized public benchmark that might be suitable for the case, I would try it. As for a more deep analysis, could you share some examples so I can grab some ideas from them?
Alignment related uArch quirks is much less likely to happen, but is still possible.
I couldn't find a benchmark to test the change against, so I prepared an artificial one based on LLVM test suite/SingleSource/Benchmarks/Adobe-C++/loop-unroll.cpp. I saw the performance gain >10%-30% depending on the iterations count (on x86_64). But I understand that that test is probably biased, so if you recommend some commonly recognized public benchmark that might be suitable for >the case, I would try it. As for a more deep analysis, could you share some examples so I can grab some ideas from them?
You want to try programs with large instruction working set and are built with PGO. Lacking those, at least SPEC06 or SPEC17
As regard the analysis, you can try perf events like cycles, retired instructions, retired taken branches, branch-mispredictions etc to help you understand the improvement better.
Unfortunatelly, I do not have access to SPEC tests. I checked with 7zip-benchmark and found that Compressing speed is slightly improved with this patch while Decompressing is slightly degraded. But I have to note that compared to a Release build, PGO one is worse even without the patch.
As for my own tests, I noticed that the performance gain with unrolled loops results from the generated code that has less instructions. In one sample, there was no need to reserve a register for the loop counters and as in each iteration the value of the counter was known, there was no instructions for increment the counter and the comparison instructions used an immediate value instead of comparing with the register; all that made the code path shorter. In another example, the subsequent iterations used overlapped data from the memory. Both compact and unrolled version loaded the data to a set of registers, but the compact version had to shift values in registers in each iteration, so that they are placed each time in the predefined registers, while unrolled version just used instructions which work this different set of registers in each iteration. Thus, unrolling again helped to find a way to generate a code with less instructions for each peeled iteration.
Thanks for the analysis. That matches what I expected -- which circles back to my original question -- how do we know the new limit is generally better than the old limit? It really depends on the cost-benefit analysis. There are workloads (as demonstrated by the Decompressing case) which gets hurt.
On the other hand, the patch itself (the restructuring part) looks good -- it makes the code cleaner. I suggest you commit the restructuring part first (e.g. by removing the *2 multiplier part). I think it is better to keep the default setting unchanged until a better analysis in place.
- Extract an NFCI part into D124388
I am not sure I understand why this patch requires any additional analysis compared to what is already implemented. The existing heuristic states that extracting up to 7 iterations for loops with estimated loop counts up to 7 is beneficial. The new heuristic is that peeling the same 7 iterations for loops with estimated loop counts slightly greater than 7 is still beneficial. They are still the same 7 iterations, with exactly the same profit. If the analysis is required, it should be applied to the whole idea of peeling loops based on the estimated trip counts.