This is an archive of the discontinued LLVM Phabricator instance.

Optimize unrolled reductions in LoopStrengthReduce
Needs ReviewPublic

Authored by ohsallen on Jan 22 2015, 9:34 AM.

Details

Reviewers
hfinkel
Summary

Break dependencies between unrolled iterations of reductions in loops. This should be particularly effective for superscalar targets. For a kernel similar to the one below, we get 2.5x speedup on POWER8 when the unroll factor is 3.

// Original reduction.
for (int i = 0; i < n; ++i)
    r += arr[i];

// Unrolled reduction.
for (int i = 0; i < n; i += 2) {
    r += arr[i];
    r += arr[i+1];
}

// Optimized reduction
float r_0 = 0;
for (int i = 0; i < n; i += 2) {
    r += arr[i];
    r_0 += arr[i+1];
}
r += r_0;

Diff Detail

Event Timeline

ohsallen updated this revision to Diff 18617.Jan 22 2015, 9:34 AM
ohsallen retitled this revision from to Optimize unrolled reductions in LoopStrengthReduce.
ohsallen updated this object.
ohsallen edited the test plan for this revision. (Show Details)
ohsallen added a reviewer: hfinkel.
ohsallen added a subscriber: Unknown Object (MLST).

Hi Olivier,

AFAIK, most, if not all, heuristics in LSR are careful not to increase register pressure. You can see that, in particular, when we rate the formulae.
On the other hand, your optimization increases the register pressure for the whole loop. This may not be a good idea in general.

Ultimately, I would like we have some kind of register pressure estimation to decide whether or not we should perform the transformation for a given loop.
Short term, it should, in my opinion, be at least parameterized on the target, i.e., we should have a target hook to decide whether or not we want to perform this optimization.

Side question, how does this patch impact the performances on the llvm test-suite?

Thanks,
-Quentin

ohsallen added a comment.EditedJan 22 2015, 11:36 AM

Quentin,

Thanks for the feedback. This makes sense and I agree. When the unrolling factor is N, N-1 additional registers are live in the loop range, so typically we could have some limit (depending on the target and/or register pressure as you suggest) and partially apply the optimization on the loop in some cases.

I'll look into the performance of the test-suite.

Olivier

Hi Olivier,

AFAIK, most, if not all, heuristics in LSR are careful not to increase register pressure. You can see that, in particular, when we rate the formulae.
On the other hand, your optimization increases the register pressure for the whole loop. This may not be a good idea in general.

Ultimately, I would like we have some kind of register pressure estimation to decide whether or not we should perform the transformation for a given loop.
Short term, it should, in my opinion, be at least parameterized on the target, i.e., we should have a target hook to decide whether or not we want to perform this optimization.

Side question, how does this patch impact the performances on the llvm test-suite?

Thanks,
-Quentin

hfinkel edited edge metadata.Jan 22 2015, 9:40 PM

I agree, this needs a register-pressure threshold. Also, I thought that the loop vectorizer would also perform this transformation as part of its interleaved unrolling capability. Does it not? If not, perhaps it really belongs there (and the vectorizer already has register pressure heuristics)?

ohsallen added a comment.EditedJan 26 2015, 3:07 PM

I agree, this needs a register-pressure threshold. Also, I thought that the loop vectorizer would also perform this transformation as part of its interleaved unrolling capability. Does it not? If not, perhaps it really belongs there (and the vectorizer already has register pressure heuristics)?

Hi Hal,

The loop vectorizer performs a similar transformation indeed, but does not allow to break dependencies between (already) unrolled iterations of a loop. For instance, consider the following:

// Original loop.
for (int i = 0; i < n; i++) 
    for (int j = 0; j < 3; j++)
        r += arr[i][j];

// After unrolling pass.
for (int i = 0; i < n; i++)  {
    r += arr[i][0];
    r += arr[i][1];
    r += arr[i][2];
}

// After vectorization pass.
for (int i = 0; i < n; i += 2)  {
    r += arr[i][0];
    r_0 += arr[i+1][0];
    r += arr[i][1];
    r_0 += arr[i+1][1];
    r += arr[i][2];
    r_0 += arr[i+1][2];
}
r += r_0;

// After strength reduction pass with changes.
for (int i = 0; i < n; i += 2)  {
    r += arr[i][0];
    r_0 += arr[i+1][0];
    r_1 += arr[i][1];
    r_2 += arr[i+1][1];
    r_3+= arr[i][2];
    r_4 += arr[i+1][2];
}
r += r_0 + r_1 + r_2 + r_3 + r_4;

The interleaved unrolling in the loop vectorizer seem to add on top of the former unrolling pass. There are two separate dependency chains after vectorization, but the code runs faster on POWER8 with three chains (and potentially even faster with up to six chains). By breaking dependencies (while checking register pressure) later in strength reduction, we can achieve better performance. It's not clear to me whether the loop vectorizer can be changed to get this behavior, I'll have to investigate.

Thanks,

Olivier

As explained in my last email, the regular loop unroller (LoopUnroll.cpp) does not break dependencies in reduction chains. Only the loop vectorizer/unroller (LoopVectorize.cpp) does. Problem with the latter is that, the code which breaks dependencies and the one which performs unrolling is tightly coupled. So, if the loop was already unrolled by the first unrolling pass, then reductions aren't optimized by the loop vectorizer/unroller.

To reuse the existing code in LoopVectorize.cpp (instead of my patch), we could choose in LoopUnroll.cpp to not unroll loops which contain reductions. Then the vectorizer would see the opportunity to unroll and perform the optimization. There would be a few cosmetic changes to do in the loop vectorizer, which I can detail if you think this is reasonable.

Olivier

As explained in my last email, the regular loop unroller (LoopUnroll.cpp) does not break dependencies in reduction chains. Only the loop vectorizer/unroller (LoopVectorize.cpp) does. Problem with the latter is that, the code which breaks dependencies and the one which performs unrolling is tightly coupled. So, if the loop was already unrolled by the first unrolling pass, then reductions aren't optimized by the loop vectorizer/unroller.

I don't understand the problem you're trying to highlight. The loop unroller is run in two places within the standard optimization pipeline. The first place is 'early', within the inliner-driven CGSCC pass manager. When run early, it does *full* unrolling only. It is also run 'late', after the loop vectorizer, when it might also do target-directed partial unrolling. But this is after the loop vectorizer runs, so there should be no conflict.

To reuse the existing code in LoopVectorize.cpp (instead of my patch), we could choose in LoopUnroll.cpp to not unroll loops which contain reductions. Then the vectorizer would see the opportunity to unroll and perform the optimization. There would be a few cosmetic changes to do in the loop vectorizer, which I can detail if you think this is reasonable.

Yes, I think that enhancing the loop vectorizer to do this is reasonable.

Olivier

Hal,

I don't understand the problem you're trying to highlight. The loop unroller is run in two places within the standard optimization pipeline. The first place is 'early', within the inliner-driven CGSCC pass manager. When run early, it does *full* unrolling only. It is also run 'late', after the loop vectorizer, when it might also do target-directed partial unrolling. But this is after the loop vectorizer runs, so there should be no conflict.

Thanks for the clarification. What I meant was that it seems harder to change the vectorizer to break dependencies of loops which are already unrolled, rather than not unrolling them in the 'early' pass and let the vectorizer and 'late' unroller do their job. I'll propose a patch that implements the second solution.

Olivier

Here is a simpler solution: when the inner loop contains reductions and gets unrolled, the loop vectorizer should unroll the outer loop and break dependencies. For the code below, it does not happen because the loop isn't considered 'small' anymore. Attached is a patch which changes the heuristics in the vectorizer unroller, and gives a 2x speedup for this code on POWER8. If it LGTY, I will add it as a regression test.

for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}

Here is a simpler solution: when the inner loop contains reductions and gets unrolled, the loop vectorizer should unroll the outer loop and break dependencies. For the code below, it does not happen because the loop isn't considered 'small' anymore. Attached is a patch which changes the heuristics in the vectorizer unroller, and gives a 2x speedup for this code on POWER8. If it LGTY, I will add it as a regression test.

for(int i=0; i<n; i++) {
  for(int i_c=0; i_c<3; i_c++) {
    _Complex __attribute__ ((aligned (8))) at = a[i][i_c];
    sum += ((__real__(at))*(__real__(at)) + (__imag__(at))*(__imag__(at)));
  }
}

Okay, interesting. Please propose this as a separate patch. We'll need to run tests on other platforms too (or also otherwise restrict it).

FWIW I'm not such a huge fan of using register pressure heuristics in the
middle end, it has a tendency to hide code gen deficiencies or be
suboptimal in cases where we could do better if we did the optimization and
let code generation fold things together/MachineLICM/etc. I understand
where the desires come from, but if possible I like to optimize as hard as
we can in the middle end and clean it up using machine optimizations.

Yes, I know this isn't always practical, but...

-eric

Eric,

Interesting point. I could eventually investigate that if I were to apply the current patch to LoopStrengthReduce. As Hal and I decided to use the existing functionality in the loop vectorizer, we will rely on the existing heuristics.

Thanks,
Olivier

FWIW I'm not such a huge fan of using register pressure heuristics in the middle end

FWIW, I agree, but this is not what LSR currently does and since the backends do not expect that yet, I would prefer moving with cautious.

Q.

Cautious is fine, but adding more register pressure heuristics to the
middle end just gets us to a point where everything is poorly done via
heuristics.

-eric

  • Original Message -----

From: "Eric Christopher" <echristo@gmail.com>
To: ohsallen@us.ibm.com, hfinkel@anl.gov
Cc: llvm-commits@cs.uiuc.edu
Sent: Tuesday, February 10, 2015 3:22:43 PM
Subject: Re: [PATCH] Optimize unrolled reductions in LoopStrengthReduce

Cautious is fine, but adding more register pressure heuristics to the
middle end just gets us to a point where everything is poorly done
via
heuristics.

FWIW, this is why I asked that this be moved to the vectorizer, where we already have a register-pressure heuristic, and I don't want to add any more of these than necessary. That is what is being worked on now.

-Hal

-eric

http://reviews.llvm.org/D7128

EMAIL PREFERENCES

http://reviews.llvm.org/settings/panel/emailpreferences/