Break dependencies between unrolled iterations of reductions in loops. This should be particularly effective for superscalar targets. For a kernel similar to the one below, we get 2.5x speedup on POWER8 when the unroll factor is 3.
// Original reduction. for (int i = 0; i < n; ++i) r += arr[i]; // Unrolled reduction. for (int i = 0; i < n; i += 2) { r += arr[i]; r += arr[i+1]; } // Optimized reduction float r_0 = 0; for (int i = 0; i < n; i += 2) { r += arr[i]; r_0 += arr[i+1]; } r += r_0;