Page MenuHomePhabricator

[Patch] Loop Interchange Pass
ClosedPublic

Authored by karthikthecool on Feb 9 2015, 5:24 AM.

Details

Summary

Hi All,
Please find attached the patch for Loop Interchange Pass for llvm. Initial RFC and design was submitted at http://reviews.llvm.org/D7432 .
This pass is disabled by default.

To give a brief intorduction it consists of 3 stages-

  1. LoopInterchangeLegality : Checks the legality of loop interchange based on distance/direction vector.
  2. LoopInterchangeProfitability: A very basic heuristic has been added to check for profitibility. This will evolve over time.
  3. LoopInterchangeTransform : Which does the actual transform.

Current Limitation:

  1. Only handles leve 2 loops for now. Will extend it going forward to support any level of loops as James had suggested during RFC.
  2. Triangular loops are not yet supported.

As Hal had suggested during RFC i went through TSVC Benchmark. Unfortunetly i didnt get time to run it but i went through the test case for loop interchange. One of the test cases s231() which was not being vectorized previously now gets vectorized. Added a similar test case in this patch.

This patch seems to be working fine and producing correct result (i.e. interchanging doesn't change the o/p of the program) to best of my knowledge.

Wanted some comments on how to go about writing test cases for this transform? Please let me know your inputs of this.
Also is it ok to do further development on trunk once this patch is finalized?

Thanks and Regards
Karthik Bhat

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
karthikthecool set the repository for this revision to rL LLVM.
karthikthecool added a subscriber: Unknown Object (MLST).
hfinkel edited edge metadata.Feb 10 2015, 2:38 PM

Please adjust all of the variable names to start with a capital letter, see: http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

As Hal had suggested during RFC i went through TSVC Benchmark. Unfortunetly i didnt get time to run it but i went through the test case for loop interchange. One of the test cases s231() which was not being vectorized previously now gets vectorized. Added a similar test case in this patch.

Great, thanks!

Also is it ok to do further development on trunk once this patch is finalized?

Yes, once this functionality is finalized, we'll move further development to trunk.

include/llvm/Transforms/Scalar.h
143

How about, "This pass interchanges loops to provide a more cache-friendly memory access patterns."

lib/Transforms/Scalar/LoopInterchange.cpp
50

Please add a comment explaining what this function computes?

55

Did you mean += ?

62

LoopInfo already has a getLoopDepth() function. Can you use that?

76

This is not actually what you want. If the loop is branched to by, for example, multiple entries of the switch statement, the predecessor can be listed multiple times in the predecessor list (and, thus, you'll have more than two incoming values even though you have only 2 predecessor blocks).

I suspect that what you actually want is that there is a unique latch and a unique predecessor, so you want that L->getLoopLatch() && L->getLoopPredecessor() [neither are nullptr].

82

Do you also need to check that AddRec->isAffine()?

87

Let's say:

// FIXME: Handle loops with more than one induction variable. Note that, currently, legality makes sure we have only one induction variable.
213

I'd move this FIXME comment somewhere else, it is not particularly useful here. It is more useful to tag places that assume only two levels.

262

Either you should handle the case where the dyn_cast fails, or if it can't fail (because we've already verified that this must be a BranchInst), then use cast<> instead.

This same comment applies to many places below as well. Only use dyn_cast if the cast can fail (in which case you should handle the nullptr case). Otherwise, use cast<>.

274

Space before 'Any'

276

licm -> LICM

298

What are you actually trying to check here? Instructions with side effects? Maybe you want I->mayHaveSideEffects()?

310

These checks look identical to those above, please make a function (a lambda function is fine).

364

Can you include *Src and *Des in these debug messages so that we can see what instructions are relevant?

416

How are you checking for reductions here? Do you need to check that the one PHI you've found is not used outside of the loop?

447

Why are you only counting uses in the latch block? Should the increment be in some other block, then what?

460

Let's say, "Inner or outer loops lack a preheader"?

Also, for the future, adding a preheader when one is not present is pretty easy (you just need to call InsertPreheaderForLoop from llvm/Transforms/Utils/LoopUtils.h), we this is a limitation that should be removed sooner rather than later (although after the initial commit is okay).

501
if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(UseInstr)) {
  ...
}
503

What happens if it is not the IV directly, but some expression of the IV? I think you'd be better off using ScalarEvolution here, get the AddRec of the GEP, and see if the "outer" AddRec is provided in terms of the SCEV of the IV (or something like that).

551

Use SplitBlock from include/llvm/Transforms/Utils/BasicBlockUtils.h?

(same for other functions below)?

726

Use llvm_unreachable, not assert(0 &&

788

Why?

818

PHIs are always at the beginning of the block; once you hit the first non-PHI, you can exit the loop (you should never find another).

Hi Hal,
Thanks for the review. Please find my comments updated below. Will upload modified patch shortly.

P.S. Sorry for the long comments,
Thanks and Regards
Karthik Bhat

include/llvm/Transforms/Scalar.h
143

Yes of course your comment makes more sense.

lib/Transforms/Scalar/LoopInterchange.cpp
50

This function get the maximum nesting level of the innermost loop. We use this to push loops of depth 2 to worklist.
For e.g.

for(int i=0;i<N;i++)
  for(int j=0j<M;j++)
    for(int k=0;k<K,k++)

here we want to return 3 as the max nesting level is 3. I have renamed the function and added comment also modified this function a bit to correctly return the max loop depth in case we have multiple inner loops. For e.g.

for(int i=0;i<N;i++) {

    for(int j=0j<M;j++) {
       for(int k=0;k<K;k++) {
         // this loop has depth 3
      }
    }

   for(intk=0;k<K;k++) {
   // this loop has depth 2
   }
}

In the above case we still return 3 as it is the max depth.

62

getLoopDepth currently only returns the nesting level of the current loop. Since we have access to outer loop here we always get nesting level as 1. So had to go with the recursive function above.

76

Updated the code. Thanks for clarifying the problem.

82

Yes i fell we need to check isAffine as well. Thanks updated the code.

87

OK. Done.

213

OK..

262

Updated code to use cast<> wherever possible. Added null checks in places were dyn_cast is being used.

274

Modified comment.

276

Modified comment.

298

Hi Hal,
The way i'm trying to conclude that a loop is tightly nested is as follows-

  1. There should not be any extra block between the outer loop and inner loop. (i.e. in this case the outer loop header would branch to inner loop preheader/inner loop body && the other branch in the header would go to the outer loop latch).

With this check we can catch loops which have a block inbetween outer and inner loop such as -

for(int i=0;i<N;i++) {
  if(X) { }
  for(int j=0;j<N;j++) {
 }
}

and conclude these as not tightly nested.

  1. Second type of non nested loops can be-

    for(int i=0;i<N;i++) { a = i; k = A[i]; for(int j=0;j<N;j++) { } }

these kind of loops will be caught by the second check which check we have a single use of indvar in latch or header which is the operand to Induction Phi(i.e used to increment/decrement the loop counter). I have modified this function a bit in the updated patch.

In the 3rd case i was trying to catch loops such as -

for(int i=0;i<N;i++) {
  foo();
  for(int j=0;j<N;j++) {
  }
}

I think we can do it using a combination of mayHaveSideEffects and mayReadFromMemory. Updated the patch.

310

Done.

364

Done.

416

We are currently checking if there is only 1 PHI node in the lop header which will corrospond to the induction variable. If we find any other PHI's either due to reductions or triangular loop structure. We currently exit as current limiation.

447

We count the uses in loop latch only as we split the latch based on this instruction. This was done because the mentioned example generated code as -

for.body3:                                        ; preds = %for.body3, %for.body3.lr.ph
%j.018 = phi i32 [ 0, %for.body3.lr.ph ], [ %add6, %for.body3 ]
%arrayidx4 = getelementptr inbounds [100 x [100 x i32]]* @A, i32 0, i32 %j.018, i32 %i.020
%5 = load i32* %arrayidx4, align 4, !tbaa !1
%add = add nsw i32 %3, %5
%add6 = add nuw nsw i32 %j.018, 1
%arrayidx8 = getelementptr inbounds [100 x [100 x i32]]* @A, i32 0, i32 %add6, i32 %add5
store i32 %add, i32* %arrayidx8, align 4, !tbaa !1
%exitcond = icmp eq i32 %j.018, %4
br i1 %exitcond, label %for.inc9.loopexit, label %for.body3

since we cannot split at

%add6 = add nuw nsw i32 %j.018, 1

we give up in this case. But now that i think about it counting uses may not be the right method to check if we can split the inner loop latch. Consider the following valid loop were we fail with this check-

for(int i=0;i<100;i++)
for(int j=0;j<100;j++)
   A[j][i] = A[j][i]+k;

here we get the inner loop latch as -

for.body3:                                        ; preds = %for.body3, %for.cond1.preheader
%j.015 = phi i32 [ 0, %for.cond1.preheader ], [ %inc, %for.body3 ]
%arrayidx4 = getelementptr inbounds [100 x [100 x i32]]* @A, i32 0, i32 %j.015, i32 %i.016
%1 = load i32* %arrayidx4, align 4, !tbaa !1
%add = add nsw i32 %0, %1
store i32 %add, i32* %arrayidx4, align 4, !tbaa !1
%inc = add nuw nsw i32 %j.015, 1
%exitcond = icmp eq i32 %inc, 100
br i1 %exitcond, label %for.inc7, label %for.body3

This could have been splitted at

%inc = add nuw nsw i32 %j.015, 1

but we fail as we find more than 1 uses.
Modified the logic to check tightly grouped inner loop latch which can be splitted.

460

Updated code to add a preheader when not present.

501

Done.

503

Updated code to use SCEV to get the loop from which we get the operand to decide it is a good or bad load.
Able to handle code like-

for(i=0;i<N;i+=1)
  for(j=0;j<N;j++)
      A[j-1][i-1] = A[j-1][i-1]+C[j-1][i-1];

after change. This now gets vectorized after interchange.

551

Updated code.

726

Done.

818

Yes you are right. Modified code.

karthikthecool edited edge metadata.
karthikthecool removed rL LLVM as the repository for this revision.

Hi Hal,
Thanks a lot for the review. Updated the patch to address review comments. Also fixed a few issues which I found during testing.
Major changes include-

  1. Logic to calculate profitibility has been made more acurate.
  2. Logic to detect were to split the inner loop is changed to be more acurate.
  3. Added test case to check the updated profitibility model.

Please let me know your inputs on this.

This still needs major work to support generic loop depths and improved profitability model.
Hopefully will be able to complete it with help from the community.

Thanks once again for the support.

Regards
Karthik Bhat

hfinkel added inline comments.Feb 12 2015, 10:42 PM
lib/Transforms/Scalar/LoopInterchange.cpp
416

No, I mean uses outside of the loops in general. I don't think you check for that. You check for:

if (numUsageinLatch + numUsageinHeader != 1)
  return false;

but the PHI could be used in any block dominated by the loop. Do you need to check for that?

int i, j;
for (int i = 0; i < n; ++i)
  for (int j = 0; j < m; ++j)
    a[i][j] = 7;

cout << "final i, j = " << i << ",  " << j << "\n";
karthikthecool added inline comments.Feb 13 2015, 1:59 AM
lib/Transforms/Scalar/LoopInterchange.cpp
416

Hi Hal,
The way i was handling this was-
since the loops were tightly coupled we were getting the lcssa phi for these loops in the outer loop latch which i was splitting and moving outside loop. I was able to get the correct value for i and j in this case.

But i think i can add check to avoid these cases as well. Since we do not want uses outside loop is it ok to have a check like-

if (isa<PHINode>(InnerLoopLatch->begin()))
  return false;
if (isa<PHINode>(OuterLoopLatch->begin()))
  return false;

This will make sure we do not have any outside uses defined inside the loop. Does this check look good. Will this make this transform too restrictive?
Thanks for answering my silly queries i'm still getting hold of loop optimizations.
Regards
Karthik Bhat

hfinkel added inline comments.Feb 13 2015, 2:27 AM
lib/Transforms/Scalar/LoopInterchange.cpp
416

Ah, you're right. I think that, given our current restrictions, the final values outside the loop nest will always be the same, so this is fine. (we should have a regression test showing that we still can interchange in this case).

Hi Hal,
Updated the test case to add a test case to cover case were we have a usage of PHI outside the loop. I have added gi,gj as global vaiable and used them as induction variables in the loop to simulate this case.

It would be great if you could give me some inputs on writing test case for this pass.
Currently the test cases i have added are all more or less similar(i.e. they get vectorized after interchange).
For checking loop that are just interchanged but not vectorized do we have to check the exact instructions after interchange or may be check the PHI instruction order in .ll (after interchange the Induction PHI will be in the reverse order) ?

Thanks
Karthik Bhat

Hi Hal,
Updated the test case to add a test case to cover case were we have a usage of PHI outside the loop. I have added gi,gj as global vaiable and used them as induction variables in the loop to simulate this case.

It would be great if you could give me some inputs on writing test case for this pass.
Currently the test cases i have added are all more or less similar(i.e. they get vectorized after interchange).
For checking loop that are just interchanged but not vectorized do we have to check the exact instructions after interchange or may be check the PHI instruction order in .ll (after interchange the Induction PHI will be in the reverse order) ?

Good point; I'd not looked carefully at the tests yet. We should not test this pass by using the vectorizer, but rather, should test the output of the interchange pass directly.

We don't need to make this harder than necessary, but, I think that for a few representative cases, we should check all of the relevant parts of the output. Then for cases that are structurally similar to those, checking the PHI order (or some other signature of the interchange) is fine.

We also should have negative tests (some loops that aren't quite tightly nested, maybe with some function call or extra memory access, etc.) and make sure they're not interchanged. We should also add tests for current limitations (like that loops with reductions are not interchanged), and put in some FIXME comments stating that these are just current limitations.

Feel free to borrow IR from the files in test/Analysis/DependenceAnalysis and adapt them as tests here.

Thanks
Karthik Bhat

Hi Hal,
Sorry for the delay in followup. I was on a vaction.
Please find the updated and rebased patch. Added test cases as per your suggestion. I also verified the o/p of the programs on randomly generated array and o/p's are same before and after interchange in cases were loops are interchanged.

I will start to work on generic version of loop interchange (i.e. to support loops of any depth) after the initial version is committed.

Please let me know your inputs on this. Thanks again for your time and review. I really appreciate your help.

Regrads
Karthik Bhat

hfinkel added inline comments.Feb 19 2015, 6:46 PM
test/Transforms/LoopInterchange/vectorize.ll
1 ↗(On Diff #20150)

Don't run the vectorizer here. Just run interchange, verify that it does what it should, and if you want end-to-end coverage through the vectorizer, add a vectorizer regression test for the interchanged loops (likely, in the subfolder of the vectorizer's regression tests for your target of interest).

Hi Hal,
Thanks for the review.
Please find the updated patch. Moved tests from vectorize.ll to profitability.ll and interchange.ll. Checking for loop interchange as per comments. Also added a negative test case to check profitabilitymodel.(i.e. were it is legal but not profitable to interchange).

Please let me know if this looks good for initial commit. This pass is currently disabled by default.

Thanks and Regards
Karthik Bhat

Hi Hal,
I had some time to work on generic version of loop interchange to support any depth. This updated patch supports loops of any depth.
The loop selection algorithm currently selects the innermost loop for interchange. Going forward we can improve this heuristic to select the most profitable loop based on Dependency matrix.
To keep it simple in the first version loops with LCSSA phi are currently not handled. I will work on handling them in later iterations.
The legality and profitability logic is pretty much the same. We use dependency matrix to conclude legality of interchange of 2 loops.

One of the TSVC benchmark test case (s231) gives 2X improvement with this patch.

I ran llvm lnt performance tests based on http://llvm.org/docs/lnt/quickstart.html#running-tests with sample size of 3 but every time i see a lot of variations in the results. I will try to run lnt with larger sample size and update the results here.

It would be great if you could let me know your inputs on this patch.

P.S. Are there any build bots which we can use to run llvm lnt/performance tests for this patch?

Thanks and Regards
Karthik Bhat

I had some time to work on generic version of loop interchange to support any depth. This updated patch supports loops of any depth.

Nice! A few comments...

include/llvm/Transforms/Scalar.h
144

You've un-improved this comment; please change it back.

lib/Transforms/IPO/PassManagerBuilder.cpp
276

I'd certainly like to have this on by default eventually, but we should be more conservative at first. Please add a command-line flag to enable this (there are several in this file already), so we can do further testing.

lib/Transforms/Scalar/LoopInterchange.cpp
12

You should use the 'cache-friendly memory access pattern' terminology here too.

51

Is this matrix generally sparse? (or could we make it sparse by picking some default). If so, is this the right data structure?

75

I'm somewhat worried about doing this eagerly for all loops; what if they're really large with lots of memory accesses? Maybe we should have a cutoff?

653

No need for { } here.

856

I don't really understand this comment. I think we can assume that LICM has run first. (and if this pass detects loop-invariant code better than LICM, that is another problem to fix, but not here).

917

No need for the { }

Hi Hal,
Please find my comments inline. Updated the patch as per review comments and fixed few issues found during llvm lnt regression.
The current version of loop interchange gives some 30% improvement in execution time in 2 benchmarks. This is because it contains code fragments like -

for (i = 0; i < _PB_N; i++)
 for (j = 0; j < _PB_N; j++)
   x[i] = x[i] + beta * A[j][i] * y[j];

which gets benefited after interchange.
There are few compile time regression which can be because of the heavy legality checks in loop interchange. I will try to fix this in next iteration.
Apart from this we found a crash in Dependency Analysis module which I'm planning to fix seperatly as i need to understand it in more detail. Will raise a bug on the same.

Thanks and Regards
Karthik Bhat

lib/Transforms/IPO/PassManagerBuilder.cpp
276

Sure. I had added it for testing performance forgot to revert before checkin. Will add a command line flag and disable it by default.

lib/Transforms/Scalar/LoopInterchange.cpp
12

Done.

51

Yes this matrix can be sparse depending on the dependence carried by the loop. I will check more on this front. Have added a TODO for now.

75

Makes sense. Added a limit of max 10 loops(Columns in the dependency matrix) and 100 dependencies(Rows of the dependency matrix).

653

Done.

856

Hi Hal,
Consider the below code of matrix multiplication-

for(int i=0;i<N;i++)
  for(int j=0;j<N;j++)
    for(int k=0;k<N;k++)
      A[i][j]= A[i][j]+B[i][k]*C[k][j]

In this example the direction vector would be -
[= = |<] (i.e. '=' dependency in i, '=' dependency in j and is loop independent dependency in k). The LICM pass would move getElementPointer for A[i][j] outside the inner loop but it cannot move the complete statement outside the inner loop.
Now since vectorizer only works on inner loop. The above code is not vectorized for i,j. But if we interchange the loops to -

for(int k=0;k<N;k++)
  for(int i=0;i<N;i++)
    for(int j=0;j<N;j++)
      A[i][j]= A[i][j]+B[i][k]*C[k][j]

now the loop gets vectorized.
It is mostly profitable to keep loop independent dependencies such as the above at the outermost possible level. We try to achieve the same here.

917

Done.

Hi Hal,
Please find the updated patch attached. Addressed review comments and fixed few issues in Loop Interchange found during llvm lnt regression. After this change we find some improvement in execution time of 2 benchamrk test cases as shown in the previous post. There are few issues in Dependency analysis as you mentioned I'm planning to address it seperatly after looking into the module in more detail. Hope that should be fine?
Please let me know your inputs on the patch.
Thanks for your time and help.
Regards
Karthik Bhat

karthikthecool set the repository for this revision to rL LLVM.

Hi All,
Rebase to trunc and update the test cases to reflect recent changes in IR format.

Patch to fix the crash in Dependency analysis mentioned above submitted at D8059. With this patch we do not see any failures in llvm lnt. As mentioned in previous comments we see execution time improvement in 2 tests and compile time regression in few test cases.

Please if you could let me know if this is good for initial checkin with pass disabled by default. We still have some work to do in this pass e.g.-

  1. Add support for reductions and lcssa phi.
  2. Improve profitability model.
  3. Improve loop selection algorithm.
  4. Improve compile time regression found in llvm lnt due to this pass.
  5. Fix issues in Dependency Analysis module.

I would like to address them one by one on trunc if everyone is OK with it.
Awaiting response.
Regards
Karthik Bhat

hfinkel accepted this revision.Mar 5 2015, 7:23 PM
hfinkel edited edge metadata.

Thanks for continuing to work on this. I have a few minor comments below, but we can move this in-tree. Please go ahead an commit, and we'll continue to iterate/test. When you commit, please commit the change to lib/Analysis/DependenceAnalysis.cpp separately.

lib/Transforms/Scalar/LoopInterchange.cpp
515

ENABLE_DEBUGGING is too generic for this. How about calling this:

DUMP_DEP_MATRICIES
547

Please make this TODO more specific. What happens now and what should happen instead?

688

Please make this more specific. We do handle anti deps. What needs to happen?

This revision is now accepted and ready to land.Mar 5 2015, 7:23 PM
karthikthecool closed this revision.Mar 6 2015, 2:24 AM

Thanks Hal. Committed as r231458 after implementing review comments. Will raise a review for DependencyAnalysis fix shortly.
Thanks and Regards
Karthik Bhat