For the code below, the loop vectorizer is expected to unroll the loop and break loop-carried dependencies by using more registers (see http://reviews.llvm.org/D7128).
for(int i=0; i<n; i++) { for(int i_c=0; i_c<3; i_c++) { _Complex __attribute__ ((aligned (8))) at = a[i][i_c]; sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at))); } }
Here, the inner loop is first unrolled by the regular loop unroller, which doesn't break dependencies. The loop vectorizer should then unroll the outer loop and break dependencies. But this doesn't happen, since the heuristics consider that only small loops are worth to unroll. This patch fixes the heuristics.
For the example above on POWER8, there is a 2.5x speedup. To handle all targets appropriately, I propose the following cost function to compute the unroll factor (not in this patch yet):
UF = UF * CriticalPathLength / LoopLength
UF is the unroll factor already computed up to this point, which takes into account register pressure and is bounded with the TTI max interleave factor. CriticalPathLength and LoopLength only take into account interesting operations (FP?). The cost function approximates reduces UF while ensuring that ILP opportunities are met. It works well for the example above on P8. Please tell me what you think.
Thanks,
Olivier
I don't like this description. I recommend just saying: