Page MenuHomePhabricator

[LoopInterchange] Add support to interchange loops with reductions.
ClosedPublic

Authored by karthikthecool on Mar 13 2015, 2:50 AM.

Details

Summary

Hi Hal,
Please find attached the patch to enable interchange of loops having reductions. The logic to detect a reduction/induction is borrowed from loop vectorizer code.

With this change we are now able to interchange matrix multiplication code such as one below-

for( int i=1;i<2048;i++)
  for( int j=1;j<2048;j++)
    for( int k=1;k<2048;k++) 
      A[i][j]+=B[i][k]*C[k][j];

into -

for( int k=1;k<2048;k++) 
  for( int i=1;i<2048;i++)
    for( int j=1;j<2048;j++)
      A[i][j]+=B[i][k]*C[k][j];

which now gets vectorized.

We observe a ~3X execution time improvement in the above code.
Please if you could let me know your inputs on the same.

Thanks and Regards
Karthik Bhat

Diff Detail

Repository
rL LLVM

Event Timeline

karthikthecool retitled this revision from to [LoopInterchange] Add support to interchange loops with reductions..
karthikthecool updated this object.
karthikthecool edited the test plan for this revision. (Show Details)
karthikthecool added reviewers: hfinkel, jmolloy.
karthikthecool added a subscriber: Unknown Object (MLST).

Hi Karthik,

Wouldn't this be better to be integrated into the vectorizer? Or at least, not duplicated code from the vectorizer?

Also, you have at least two other bold moves:

  1. Move a lot of internal methods to static. This could be ok, but needs some further explanation why it is so.
  2. Move the interchange pass down the pipeline. Again, it could be the right thing to do, but also needs some context.

Finally, how did you test and benchmark your changes? "It makes my code 3x faster" is not enough to include such a big change. The steps you need to complete to get such a change in are in order:

  1. Run the "make check-all" tests and make sure they're green
  2. Run the test-suite and make sure it passes *and* doesn't regress performance
  3. Provide more information about the benchmarks that you tested it, relative improvements, regressions, etc.

cheers,
--renato

Hi Renato,
Thanks for looking into the patch. Please find my comments inline.

Wouldn't this be better to be integrated into the vectorizer? Or at least, not duplicated code from the vectorizer?

Since this is a complete pass integrating this will loop vectorizer would be a bit difficult. I can try to move common functions into a utility file if required.

Also, you have at least two other bold moves:

  1. Move a lot of internal methods to static. This could be ok, but needs some further explanation why it is so.

These methods are helper functions. Helper functions are usually marked static in other Passes i had missed it in my initial commit hence corrected it here.

  1. Move the interchange pass down the pipeline. Again, it could be the right thing to do, but also needs some context.

I had to move interchange pass down the pipeline and add licm and loop unswitch pass after loop interchange as after the inner loop header/loop latches are split and loops are interchanged we get multiple blocks with successors outside the loop(i.e. getExitingBlock was null). This kind of loop is not handled by vectorize and hence had to run licm and loop unswitch to remove unconditional branches between blocks in the inner loop.

Finally, how did you test and benchmark your changes?

I have run llvm lnt test cases but unfortunately didn't observe much improvement with this patch. I'm planning to run phoronix test suites to see if it gives improvement in some known benchmark. The matrix multiplication code was from one of our internal test cases which showed improvement post this patch. Have added few test cases (positive and negative) and ran make check all to make sure there are no regression. Also have written few local C test cases to check that the o/p's are same after interchange.

  1. Run the "make check-all" tests and make sure they're green

Yes make check all passes with this patch. Have also added new test cases to test this feature.

  1. Run the test-suite and make sure it passes *and* doesn't regress performance
  2. Provide more information about the benchmarks that you tested it, relative improvements, regressions, etc.

Yes ran llvm lnt benchmark but didn't observe much improvement. No regressions were observed either. I have observed one crash in Dependency Analysis module which is triggered after this patch. Since this pass is currently disabled by default I was planning to address the crash in Dependency Analysis module separately. Will that be ok?

I will try to get more benchmark data using phoronix test suites over the weekend and get back to you.

Thanks once again for looking into the patch. Looking forward to your inputs on the patch.
Regards
Karthik Bhat

karthikthecool edited the test plan for this revision. (Show Details)

Hi Hal,Renato,
Refactor some common code into functions. I have currently borrowed and modified some functions from loop vectorizer. Do i need to refactor them into a common utility as well? These functions such as AddReductionVar seems to be a bit tightly bound with loop vectorizer code.

Second change is in PassManagerBuilder. Running SimplifyCFGPass after LoopInterchange is sufficient to merge and remove redundant basic blocks(blocks with just unconditional branch) produced after loop interhcange.Update the code to reflect the same.

I ran few phoronix benchmarks and lnt benchamrks but unfortunetly didn't see any improvement/regression due to this patch.

As mentioned in previous comments post this change code such as-

void matrixMult(int N, int M, int K) {
  for(int i=0;i<N;i++)
    for(int j=0;j<M;j++)
      for(int k=0;k<K;k++)
        A[i][j]+=B[i][k]*C[k][j];
}

gets vectorized givinig some execution time improvement during large matrix multiplication.

Please if you could let me know your inputs on the same.
Also are there are matrix multiplication benchark which i can test to see if the kind of code i mentioned above gets triggered?

Thanks and Regards
Karthik Bhat

rengolin set the repository for this revision to rL LLVM.

Refactor some common code into functions. I have currently borrowed and modified some functions from loop vectorizer. Do i need to refactor them into a common utility as well? These functions such as AddReductionVar seems to be a bit tightly bound with loop vectorizer code.

Yes, they are, and I can see what the problem is. But there is a lot of duplication added by this patch and I'm still uncomfortable. I've added Nadav and Arnold, our loop vectorizer experts, to assist on what to do next.

I strongly suggest against duplication, and the only option I can think of is to spot the pattern while creating the reduction variable. You can create a function to iterate all containing loops and inspect all the ranges to make sure they match your pattern. Early exits should be made if the loop is not deep enough, or the outer loops don't iterate through any of the affected induction variables in your reduction.

Second change is in PassManagerBuilder. Running SimplifyCFGPass after LoopInterchange is sufficient to merge and remove redundant basic blocks(blocks with just unconditional branch) produced after loop interhcange.Update the code to reflect the same.

This is good news. Means that the pass is a lot less dramatic than you anticipated. :) This gives me hope that doing this inside the loop vectorizer can be managed.

I ran few phoronix benchmarks and lnt benchamrks but unfortunetly didn't see any improvement/regression due to this patch.

I'd say "fortunately", since you haven't introduced any regressions, and that's a great thing!

As mentioned in previous comments post this change code such as-

void matrixMult(int N, int M, int K) {
  for(int i=0;i<N;i++)
    for(int j=0;j<M;j++)
      for(int k=0;k<K;k++)
        A[i][j]+=B[i][k]*C[k][j];
}

gets vectorized givinig some execution time improvement during large matrix multiplication.

It seems we don't have that kind of benchmark on our test suite, and it would be good to have one. I don't know one off the top of my head, but maybe Hal/Nadav/Arnold could help.

cheers,
--renato

Thanks Renato, Tobias for your inputs. Please find my comments inline-

Yes, they are, and I can see what the problem is. But there is a lot of duplication added by this patch and I'm still uncomfortable. I've added Nadav and Arnold, our loop vectorizer experts, to assist on what to do next.

Sure. The functions currently duplicated are - isReductionPHI and helper functions and isInductionPHI. Is it ok to move them to somewhere like LoopBase. Loop can expose an API to check if a variable is induction/reduction in the loop? It can be reused by other modules in this case.

I strongly suggest against duplication, and the only option I can think of is to spot the pattern while creating the reduction variable. You can create a function to iterate all containing loops and inspect all the ranges to make sure they match your pattern. Early exits should be made if the loop is not deep enough, or the outer loops don't iterate through any of the affected induction variables in your reduction.

We do have early exits in the code. It was checked in in the initial version. E.g. min loop depth, dependency checks to see it interchange is safe etc are done before populating reduction/induction phi's.

This is good news. Means that the pass is a lot less dramatic than you anticipated. :) This gives me hope that doing this inside the loop vectorizer can be managed.

Yes :) . But i still feel that this should be a seperate pass and should not be moved inside loop vectorizer the reason is loop interchange is not specifically for vectorization of code. Based on the profitability model it can be used for cache resuse or register reuse etc. So having it as a seperate pass looks like a good option to me. Please let me know if you feel otherwise.

It seems we don't have that kind of benchmark on our test suite, and it would be good to have one. I don't know one off the top of my head, but maybe Hal/Nadav/Arnold could help.

Thanks Tobias for pointing out the test case SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm.c. But the loop inside the test case is something like -

for (i = 0; i < _PB_NI; i++)
  for (j = 0; j < _PB_NJ; j++) {
    C[i][j] *= beta;
    for (k = 0; k < _PB_NK; ++k)
      C[i][j] += alpha * A[i][k] * B[k][j];
    }

Since loops j and k are not tightly nested we currently do not interchange these loops. As you mentioned loop interchange will give better results when it works along with other passes such as loop tiling, loop splitting etc.

Thanks for spending your valuable time on this patch. Is it ok to go ahead with moving duplicated code into somewhere as a Loop API/function or if you could suggest some better place to reduce code duplication.
Any other comments in general about the patch is welcome as well..:)
Thanks again for your time.
Regards
Karthik Bhat

Sure. The functions currently duplicated are - isReductionPHI and helper functions and isInductionPHI. Is it ok to move them to somewhere like LoopBase. Loop can expose an API to check if a variable is induction/reduction in the loop? It can be reused by other modules in this case.

I'm not sure this would work, because isReduction/Induction in the vectorizer is very specific to the internal vectorizer's data structures. Unless you manage to peel their common functionality without messing the vectorizer too much.

I agree that this pass is not just good for vectorizing, but it's applicable to a very specific use case which the whole point is to get vectorized. I'll see if I can get Nadav/Arnold's attention to this review...

cheers,
--renato

Sure. The functions currently duplicated are - isReductionPHI and helper functions and isInductionPHI. Is it ok to move them to somewhere like LoopBase. Loop can expose an API to check if a variable is induction/reduction in the loop? It can be reused by other modules in this case.

Karthik,

Can you try and refactor those two functions to be generic and make both loop vectorizer and interchange to use it?

cheers,
--renato

karthikthecool removed rL LLVM as the repository for this revision.

Hi Renato,
Sorry for the delay in followup on this patch was stuck in some other work. I was able to refactor isInductionVar out of LoopVectorizer and we are using the re-factored function in this patch.

Re-factoring AddReductionVar though seems a bit tricky as i'm not sure if we should expose all the enum/structs that are currently being used to support it. I'm currently only using a part of that code.

I have added a TODO for the same (i.e. re-factor isReductionPHI) for now. Is it OK to address refactoring of isReductionPHI in future? I wanted to address few other pending issues in LoopInterchange.

Please let me know if you feel otherwise. Will try to refactor the code on priority.
Thanks a lot for your time and help I really appreciate it.

Regards
Karthik Bhat

Add full context diff.

Hi Hal,Renato,
A gentle ping for review.

Hi Kharthik,

We were all in the EuroLLVM, and I'm still not back home. Sorry for the
delay, I'll look at it first thing tomorrow.

Cheers,
Renato

Hi Karthik,

Copying the vectorizer's reduction detection here is not the way forward. Please, refactor the detection part into a generic function.

My initial guess would be to create a vectorizer common library in lib/Transforms/Utils and move the reduction detection in there, like the other loop utilities, and get the vectorizer and your pass to use that.

cheers,
--renato

lib/Transforms/Scalar/LoopInterchange.cpp
104 ↗(On Diff #22786)

avoid white space/empty line changes with code changes.

373 ↗(On Diff #22786)

I don't think we should add this code here, not even with a TODO to refactor this, because this TODO will never be done.

Hi Renato,
Updated code as per reveiw comments. Refactored reduction identification code out of loop vectorizer and reusing the same in this pass. This code assumes D9046 has been applied.

Please let me know your inputs on this.

Thanks for your continued support. I really appreciate it.

Thanks and Regards
Karthik Bhat

Hi Karthik,

Looks a lot better, thanks!

I'll let @jmolloy review this one, as he was more tuned to this issue. I have no more concerns, thank you.

cheers,
--renato

Hi Renato,
Thanks for the review and time. Updating code to reflect changes done in r235284.

Hi Hal, James,
Could you please share your valuable inputs on this patch.
Thanks and Regards
Karthik Bhat

Hi Karthik,

The patch is looking good, apart from the few comments. I'd welcome @jmolloy's comments.

cheers,
--renato

lib/Transforms/Scalar/LoopInterchange.cpp
611 ↗(On Diff #24012)

nitpick, please join this "else if".

628 ↗(On Diff #24012)

nitpick, please join this "else if".

test/Transforms/LoopInterchange/reductions.ll
3 ↗(On Diff #24012)

This is a generic test, it must not contain a target triple or it will fail on all aches minus x86_64. We do want to test this in ARM, MIPS, PPC, etc, so we should remove the triple and make sure it works on all buildbots.

124 ↗(On Diff #24012)

These check lines are bound to fail on multiple architectures and with other optimisations coming in later, they could break the sequence or reorder instructions. You need to parse what's really relevant in the right order with the right arguments stored in variables "[[foo]]" and removed of architecture types to make sure it passes on 16/32/64 bit machines.

The same is true for the other tests.

jmolloy edited edge metadata.Apr 20 2015, 9:22 AM

Hi Karthik,

See my inline comments.

Cheers,

James

lib/Transforms/Scalar/LoopInterchange.cpp
258 ↗(On Diff #24012)

s/Theorm/Theorem

590 ↗(On Diff #24012)

Can you rename this "areAllUsesReductions"? it sounds more like a boolean query, which is what this is. "Check" implies some action.

595 ↗(On Diff #24012)

You can merge these two if's:

if (!UserIns || !ReductionDescriptor::isReductionPHI(UserIns, L, RD))
  return false;

If you wanted to be *really* cool...

return !std::any(Ins->user_begin(), Ins->user_end(), [](User *U) {
  PHINode *UserIns = dyn_cast<PHINode>(*I);
  ReductionDescriptor RD;
  return UserIns && ReductionDescriptor::isReductionPHI(UserIns, L, RD);
});
696 ↗(On Diff #24012)

just:

if (!L->getLoopLatch() || !L->getLoopPredecessor())
  return false;
707 ↗(On Diff #24012)

Probably a good idea to have some debugging output here saying what PHI failed to be recognized?

732 ↗(On Diff #24012)

Assert that this is 2?

757 ↗(On Diff #24012)

Some debugging output saying why it failed would be nice

760 ↗(On Diff #24012)

I don't like this, you're mutating the content of the class for no good reason. It would be better to explicitly give Inductions and Reductions (a stack local variable) to populateInductionAndReductions(), and rename it to findInductionAndReductions()

1063 ↗(On Diff #24012)

You found this out in LoopInterchangeLegality, but threw away the result. Why recalculate it here?

1081 ↗(On Diff #24012)

s/TODO/FIXME

1093 ↗(On Diff #24012)

Just:

for (auto U : PHI->users())
1138 ↗(On Diff #24012)

Needs a bailout if I is not a PHINode.

Hi James, Renato,
Thanks for the comments. Updated the code to address review comments in LoopInterchange. Please find my comments inline.
Thanks and Regards
Karthik Bhat

lib/Transforms/Scalar/LoopInterchange.cpp
258 ↗(On Diff #24012)

Updated.

590 ↗(On Diff #24012)

Yes the new naming seems much better.Updated code.

595 ↗(On Diff #24012)

Wow that was cool..:) got to learn and use lambda function. Updated code. But i think it shoulde be-

return !std::any_of(Ins->user_begin(), Ins->user_end(), [=](User *U) {
  PHINode *UserIns = dyn_cast<PHINode>(U);
  ReductionDescriptor RD;
  return !UserIns || !ReductionDescriptor::isReductionPHI(UserIns, L, RD);
});
611 ↗(On Diff #24012)

Updated code.

628 ↗(On Diff #24012)

Updated code.

696 ↗(On Diff #24012)

Updated code.

707 ↗(On Diff #24012)

Added Debugging output.

732 ↗(On Diff #24012)

Oops. This should be getNumSuccessors(). Updated code and added assertion.

757 ↗(On Diff #24012)

Done.

760 ↗(On Diff #24012)

Updated code.

1063 ↗(On Diff #24012)

Modified code. Reusing value calculated by LoopInterchangeLegality in LoopInterchangeTransform.

1081 ↗(On Diff #24012)

Done.

1093 ↗(On Diff #24012)

I think this for loop was redundant. I wanted to replace all uses of PHI with that of incoming value from Header. PHI->replaceAllUsesWith(V); will suffice. This loop is not required deleted the same.

1138 ↗(On Diff #24012)

Updated code to use "cast" instead of "dyn_cast" as I will always be a PHINode if it enters the for loop.

test/Transforms/LoopInterchange/reductions.ll
3 ↗(On Diff #24012)

Wont this only run if triple is x86_64-unknown-linux-gnu? But i agree i need to add some general tests as well.

124 ↗(On Diff #24012)

The problem here was I wanted to check if loops are interchanged properly and for that i was checking the complete structure of interchanged loop for correctness. I will try to reduce the checks.

karthikthecool edited edge metadata.

Hi James,Renato,
Updated LoopInterchange.cpp as per review comments. Please have a look when you find time.
Thanks a lot for your time and guidance.
Thanks and Regards
Karthik Bhat

rengolin added inline comments.Apr 21 2015, 5:28 AM
test/Transforms/LoopInterchange/reductions.ll
3 ↗(On Diff #24012)

No. To force it to run only on one architecture is to either REQUIRE: x86_64 or to move it into a directory that is platform-specific, with a lit.cfg file that does the same thing. But we want neither.

Since this is not a platform pass, I think this optimisation should be tested in all platforms, unless we have a good reason not to. Taking away the triple will ensure that the target will be picked as the host, which is what we want. The IR might have to change to accommodate on other targets. I can test it on ARM/AArch64 to be sure.

124 ↗(On Diff #24012)

That's fine, but you can't rely on the types and names as much as in the sequence of instructions. Only parse types and variables if they *must* be of a specific pattern. If not, just check the instruction names and correct order.

karthikthecool added inline comments.Apr 21 2015, 5:38 AM
test/Transforms/LoopInterchange/reductions.ll
3 ↗(On Diff #24012)

Interesting because i had added some testcases with target triple in the initial checkin of this pass and they seems to pass.
Also i saw similar tests in LoopReroll which i took as reference.

But i agree we need generic tests. I will add/update the same.
Wow it would be great if you could help me with the results of ARM/AArch64. I will try out the same as well.

124 ↗(On Diff #24012)

Got your point. Will update the tests by tomorrow.
Thanks for the comments.:)

OK, this looks OK to me now.

I tested on ARM and it works if you remove the data layout / triple from the test. Once you remove that and change the CHECK lines to be a bit less specific, looks good to me, too. Thanks!

Hi Renato,
Updated test cases as per comments. Please let me know if this looks good to you.
Verified with Debug+Assert build and make check-all on X86_64.

Thanks and Regards
Karthik Bhat

rengolin accepted this revision.Apr 22 2015, 9:53 AM
rengolin added a reviewer: rengolin.

Hi Karthik,

Looks good to me with the two nitpicks. Feel free to commit with those changes.

Thanks!
-renato

test/Transforms/LoopInterchange/reductions.ll
49 ↗(On Diff #24198)

nitpick: you don't need to match %0 here. It won't match if some pass adds a new unrelated operation. Just match up to promoted.

113 ↗(On Diff #24198)

Same here. Mainly because %1 and %0 were not part of the rest of the match.

This revision is now accepted and ready to land.Apr 22 2015, 9:53 AM
This revision was automatically updated to reflect the committed changes.