Page MenuHomePhabricator

Loop Rerolling Pass
ClosedPublic

Authored by hfinkel on Oct 15 2013, 12:09 PM.

Details

Summary

I've created a loop rerolling pass. The transformation aims to take loops like this:

for (int i = 0; i < 3200; i += 5) {
  a[i] += alpha * b[i];
  a[i + 1] += alpha * b[i + 1];
  a[i + 2] += alpha * b[i + 2];
  a[i + 3] += alpha * b[i + 3];
  a[i + 4] += alpha * b[i + 4];
}

and turn them into this:

for (int i = 0; i < 3200; ++i) {
  a[i] += alpha * b[i];
}

and loops like this:

for (int i = 0; i < 500; ++i) {
  x[3*i] = foo(0);
  x[3*i+1] = foo(0);
  x[3*i+2] = foo(0);
}

and turn them into this:

for (int i = 0; i < 1500; ++i) {
  x[i] = foo(0);
}

There are two motivations for this transformation:

  1. Code-size reduction (especially relevant, obviously, when compiling for code size).
  2. Providing greater choice to the loop vectorizer (and generic unroller) to choose the unrolling factor (and a better ability to vectorize). The loop vectorizer can take vector lengths and register pressure into account when choosing an unrolling factor, for example, and a pre-unrolled loop limits that choice. This is especially problematic if the manual unrolling was optimized for a machine different from the current target.

The current implementation is limited to single basic-block loops only. The rerolling recognition should work regardless of how the loop iterations are intermixed within the loop body (subject to dependency and side-effect constraints), but the significant restriction is that the order of the instructions in each iteration must be identical. This seems sufficient to capture all of my current use cases.

The transformation triggers very rarely on the test suite (which I think it good, programmers should be able to leave trivial unrolling to the compiler). When I insert this pass just prior to loop vectorization, and prior to SLP vectorization (so that we prefer to reroll over SLP vectorizing), it helps:

On an Intel Xeon E5430:
MultiSource/Benchmarks/TSVC/LoopRerolling-flt: 36% speedup (loops s351 and s353 are rerolled, s353's performance regresses by 9%, but s351 exhibits a 76% speedup; all others are unchanged)
MultiSource/Benchmarks/TSVC/LoopRerolling-dbl: 13% speedup (loops s351 and s353 are rerolled, s353's performance is essentially unchanged, but s351 exhibits a 38% speedup; all others are unchanged)
FreeBench/distray/distray: No significant change

Please review.

Thanks again,
Hal

Diff Detail

Event Timeline

nadav added a comment.Oct 15 2013, 1:18 PM

Hi Hal,

Thanks for working on this. The motivation for loop-rolling is clear and I agree that it has the potential for reducing code size and increasing performance. Have you considered using the SLP-vectorizer to detect consecutive/parallel statements ? You could construct the SLP tree and use this information for the rolling transformation. The second question I had was how many times is this triggered in the test suite ?

Thanks,
Nadav

hfinkel updated this revision to Unknown Object (????).Oct 17 2013, 7:54 AM

Here's an updated version which supports reductions. This means that the pass now handles all three rerolling tests from the TSVC benchmark.

This code certainly needs some refactoring, and I'm working on that, but I think that the implementation is functionally complete at this point (at least in terms of handling the single basic-block case). If you all agree, I can do this refactoring work in trunk.

hfinkel updated this revision to Unknown Object (????).Oct 22 2013, 11:25 AM

Here's an updated patch. I don't think that there are any functional changes, but I've done some refactoring work.

Why are you looking for Phi-IVs across all loop blocks instead of just the header block in collectPossibleIVs and collectPossibleReductions?

Why is collectPossibleReductions a separate pass from collectPossibleIVs?

I'm not sure what IVInfo is for. It looks like the SCEV field is always the result of calling getSCEV on the first field.

I'm not sure why findScaleFromMul needs to look for Mul instructions instead of examining SCEVs.

I don't know what "Final" is for.

Eventually I had to punt on the review. I wasn't able to grok everything that is going on in LoopReroll::reroll. The function is > 400 lines, with lots of state. Since I don't really understand the algorithm, it's hard for me to suggest a better way of doing it. If the results are good and compilation time isn't terrible, I can't argue against checking it in as a prototype. It needs a lot of improvement though before I would consider it reviewable/maintainable.

lib/Transforms/Scalar/LoopRerollPass.cpp
216

Did you mean PHISCEV->getStepRecurrence()

  • Original Message -----
Why are you looking for Phi-IVs across all loop blocks instead of
just the header block in collectPossibleIVs and
collectPossibleReductions?

Why is collectPossibleReductions a separate pass from
collectPossibleIVs?

These could be merged. I split them in order to make the code easier to read, and so that we did not bother analyzing possible reductions if no suitable IVs for rerolling are available.

I'm not sure what IVInfo is for. It looks like the SCEV field is
always the result of calling getSCEV on the first field.

I had thought that it would be better to cache the result so I would not need to call getSCEV again. As you imply, this is probably not worthwhile.

I'm not sure why findScaleFromMul needs to look for Mul
instructions instead of examining SCEVs.

I probably should look at the SCEVs (because the Mul might be a shift, etc.).

I don't know what "Final" is for.

When collecting the set of users, there are three inputs:

  • The root instruction
  • The set of excluded instructions (which are never added to the use set even if they are users)
  • The set of final instructions (which are added to the use set if they are users, but on which we don't recurse looking for more users)
Eventually I had to punt on the review. I wasn't able to grok
everything that is going on in LoopReroll::reroll. The function is
> 400 lines, with lots of state. Since I don't really understand
the algorithm, it's hard for me to suggest a better way of doing
it. If the results are good and compilation time isn't terrible, I
can't argue against checking it in as a prototype. It needs a lot
of improvement though before I would consider it
reviewable/maintainable.

I don't disagree with you. It still looks like something I hacked together at midnight, and I'm still actively working on refactoring, adding more comments, etc. The question is just whether I can do this in-tree or whether more is needed before an initial checkin.

Comment at: lib/Transforms/Scalar/LoopRerollPass.cpp:215
@@ +214,3 @@
+ if (const SCEVConstant *IncSCEV =
+ dyn_cast<SCEVConstant>(PHISCEV->getOperand(1))) {

+ if (!IncSCEV->getValue()->getValue().isStrictlyPositive())

Did you mean PHISCEV->getStepRecurrence()

Yes.

Thanks again,
Hal

http://llvm-reviews.chandlerc.com/D1940

  • Original Message -----
  • Original Message -----
Why are you looking for Phi-IVs across all loop blocks instead of
just the header block in collectPossibleIVs and
collectPossibleReductions?

I had some thought that, in general, the PHIs could be in other blocks in the loop, not just the header. However, unless we had some chain of BBs linked with unconditional branches, this likely can't be the case (and still have the PHI be useful). I'll just search the header.

-Hal

Why is collectPossibleReductions a separate pass from
collectPossibleIVs?

These could be merged. I split them in order to make the code easier
to read, and so that we did not bother analyzing possible reductions
if no suitable IVs for rerolling are available.

I'm not sure what IVInfo is for. It looks like the SCEV field is
always the result of calling getSCEV on the first field.

I had thought that it would be better to cache the result so I would
not need to call getSCEV again. As you imply, this is probably not
worthwhile.

I'm not sure why findScaleFromMul needs to look for Mul
instructions instead of examining SCEVs.

I probably should look at the SCEVs (because the Mul might be a
shift, etc.).

I don't know what "Final" is for.

When collecting the set of users, there are three inputs:

  • The root instruction
  • The set of excluded instructions (which are never added to the use set even if they are users)
  • The set of final instructions (which are added to the use set if they are users, but on which we don't recurse looking for more users)
Eventually I had to punt on the review. I wasn't able to grok
everything that is going on in LoopReroll::reroll. The function
is
> 400 lines, with lots of state. Since I don't really understand
the algorithm, it's hard for me to suggest a better way of doing
it. If the results are good and compilation time isn't terrible,
I
can't argue against checking it in as a prototype. It needs a lot
of improvement though before I would consider it
reviewable/maintainable.

I don't disagree with you. It still looks like something I hacked
together at midnight, and I'm still actively working on refactoring,
adding more comments, etc. The question is just whether I can do
this in-tree or whether more is needed before an initial checkin.

Comment at: lib/Transforms/Scalar/LoopRerollPass.cpp:215
@@ +214,3 @@
+ if (const SCEVConstant *IncSCEV =
+ dyn_cast<SCEVConstant>(PHISCEV->getOperand(1))) {
+ if

(!IncSCEV->getValue()->getValue().isStrictlyPositive())

Did you mean PHISCEV->getStepRecurrence()

Yes.

Thanks again,
Hal

http://llvm-reviews.chandlerc.com/D1940

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

  • Original Message -----
  • Original Message -----
Why are you looking for Phi-IVs across all loop blocks instead of
just the header block in collectPossibleIVs and
collectPossibleReductions?

Why is collectPossibleReductions a separate pass from
collectPossibleIVs?

These could be merged. I split them in order to make the code easier
to read, and so that we did not bother analyzing possible reductions
if no suitable IVs for rerolling are available.

I'm not sure what IVInfo is for. It looks like the SCEV field is
always the result of calling getSCEV on the first field.

I had thought that it would be better to cache the result so I would
not need to call getSCEV again. As you imply, this is probably not
worthwhile.

I now recall that it was to prevent re-casting to SCEVAddRecExpr multiple times (better type safety).

-Hal

I'm not sure why findScaleFromMul needs to look for Mul
instructions instead of examining SCEVs.

I probably should look at the SCEVs (because the Mul might be a
shift, etc.).

I don't know what "Final" is for.

When collecting the set of users, there are three inputs:

  • The root instruction
  • The set of excluded instructions (which are never added to the use set even if they are users)
  • The set of final instructions (which are added to the use set if they are users, but on which we don't recurse looking for more users)
Eventually I had to punt on the review. I wasn't able to grok
everything that is going on in LoopReroll::reroll. The function
is
> 400 lines, with lots of state. Since I don't really understand
the algorithm, it's hard for me to suggest a better way of doing
it. If the results are good and compilation time isn't terrible,
I
can't argue against checking it in as a prototype. It needs a lot
of improvement though before I would consider it
reviewable/maintainable.

I don't disagree with you. It still looks like something I hacked
together at midnight, and I'm still actively working on refactoring,
adding more comments, etc. The question is just whether I can do
this in-tree or whether more is needed before an initial checkin.

Comment at: lib/Transforms/Scalar/LoopRerollPass.cpp:215
@@ +214,3 @@
+ if (const SCEVConstant *IncSCEV =
+ dyn_cast<SCEVConstant>(PHISCEV->getOperand(1))) {
+ if

(!IncSCEV->getValue()->getValue().isStrictlyPositive())

Did you mean PHISCEV->getStepRecurrence()

Yes.

Thanks again,
Hal

http://llvm-reviews.chandlerc.com/D1940

Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory


llvm-commits mailing list
llvm-commits@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

hfinkel updated this revision to Unknown Object (????).Oct 28 2013, 5:19 PM

Address a couple of Andy's review suggestions, and add more comments.

There is more to come; but this should make things slightly clearer in the mean time.

hfinkel updated this revision to Unknown Object (????).Oct 29 2013, 4:39 PM

This should address the remainder of the initial review comments. I've factored out the reduction tracking into its own class, and made a number of other improvements. I've added more function documentation, including documentation on the overall algorithm. Hopefully, this is more reviewable now.

To be clear, the pass-manager change is only there to ease testing, and would not be committed in this way.

Thanks again!

rengolin accepted this revision.Nov 17 2013, 2:46 AM

Committed in r194939

rengolin closed this revision.Nov 17 2013, 2:47 AM