Break dependencies between unrolled iterations of reductions in loops. This should be particularly effective for superscalar targets. For a kernel similar to the one below, we get 2.5x speedup on POWER8 when the unroll factor is 3.
// Original reduction.
for (int i = 0; i < n; ++i)
r += arr[i];
// Unrolled reduction.
for (int i = 0; i < n; i += 2) {
r += arr[i];
r += arr[i+1];
}
// Optimized reduction
float r_0 = 0;
for (int i = 0; i < n; i += 2) {
r += arr[i];
r_0 += arr[i+1];
}
r += r_0;