This is an archive of the discontinued LLVM Phabricator instance.

[RFC] Loop Interchane Pass
AbandonedPublic

Authored by karthikthecool on Feb 5 2015, 5:25 AM.

Details

Reviewers
None
Summary

Hi All,
I have been working on a loop interchange pass for llvm. The motivation is to improve the cache hit to improve performance.
Would like to get your inputs on the same.

Currently this pass only handles loop of depth 2 other loops are ignored. Goining forward we would like to fix this.
This opt is disabledby default.

LoopInterchange Pass is divided into 3 parts-

  1. LoopInterchangeLegality
  2. LoopInterchangeProfitability
  3. LoopInterchangeTransform

LoopInterchangeLegality:
This class checks all the memory instructions in the loop and uses DependenceAnalysis to conclude if we can interchange the loop or not.

LoopInterchangeLegality Functions:

canInterchangeLoops - Checks if the loops can be interchanged.
checkDependence - Called by canInterchangeLoops. Does the actual DependenceAnalysis to conclude if we can interchange the loops
currentLimiations - This function marks loops are illegal due to current limitation in the way the transform is written. I intend to fix these issues.

LoopInterchangeProfitability:
This class checks if it is profitable to interchange the loops. Currently i use only 1 heuristic which is the order in which the array elements are accessed (i.e. row major/column major) and count the good and bad order. If we have bad order more than good order we interchange. We can improve the heuristics here later on.

LoopInterchangeProfitability Functions:

isProfitable - Concludes if it is profitable to interchange.
getInstrOrderCost - Calculates the array access order heuristics.

LoopInterchangeTransform:
This transforms the loop and interchanges the inner loop with the outer loop. I'm writing a loop optimization for the first time and have few doubts here.
The way we have interchanged the loop is -

  1. Split the inner loop header and move all phi nodes into a seperate block. This will be the new outer header.
  2. Split the loop latch at the indvar.next instruction( This may need improvement as indicated in a TODO in the code) and this will be the new outer loop latch.
  3. Adjust the loop links by adjusting the branch instructions so that we move the inner loop outside.
  4. Fix PHi nodes due to the step 3 as previous basicblock from which it branched will have changed.

After this step the loop is interchanged. It gives the correct results for few of the sample loops which i checked but I'm not sure if this is the right way to perform interchange transform. I suppose i will have to update the dom tree etc? Could someone please guide me if this is the right way to go forward or we need to transform in some other way?

I checked the transform with the following and some other examples it seems to interchange the loops when legal and profitable.I checked the o/p with and without the transform. They seem to be the same in the cases which i have checked.

#include <iostream>
using namespace std;
int N,M;
int A[100][100],B[100][100];
int k;
int main() {

cin >> N >> M;
for(int i=0;i<N;i++)

for(int j=0;j<M;j++)
 cin >> A[i][j];

for(int j=0;j<N;j++) {

for(int i=0;i<M;i++) {
      if(i%2)
        A[i+2][j+2] = A[i][j]+k;
      else
       A[i][j+1] = A[i][j]+k;
  }
}

for(int i=0;i<N;i++)

 for(int j=0;j<M;j++)
  cout << A[i][j] <<"\n";

return 0;

}

Running loop-interchange-

opt -basicaa -loop-interchange test.ll -o test.bc

Awaiting comments and inputs.

Thanks and Regards
Karthik Bhat

Diff Detail

Event Timeline

karthikthecool retitled this revision from to [RFC] Loop Interchane Pass.
karthikthecool updated this object.
karthikthecool edited the test plan for this revision. (Show Details)
jmolloy added a subscriber: jmolloy.Feb 5 2015, 8:17 AM

Hi Karthik,

Thanks for working on this! Loop interchange is something I think LLVM really needs. I have some high level comments on your approach.

  • You mention you only handle 2-deep nested loops. But actually, you can handle any level of nesting by doing pairwise swaps working from innermost nest out.
  • Your function to determine whether to swap or not is too simplistic. It will only catch a very small number of cases (GEPs from the IV). The obvious way to do this is to grab the SCEV for each load and store, and if it is an AddRecExpr, check the step value.
  • Your described algorithm for performing the swap seems fine, but the implementation is awfully complicated. Once you have split the latch, I'd have thought you could just split the PHIs from the top of each header block. This would give you a bunch of blocks with branches between them - just rearrange the branches (and perform trivial PHI updates).

Cheers,

James

lib/Transforms/Scalar/LoopInterchange.cpp
64

Loop *Inner = L.getSubLoops().back(); ?

Great to see this being worked on! It will help the implicit pocl work-group vectorization after the 2-level loop restricting is lifted. Related to this, be careful with the loop id metadata and especially the parallel loop metadata. The parallel loop metadata can be exploited to skip dependency analysis in some cases (At least when both of the loops are parallel? Needs to be thought through.). But if the metadata is moved to a wrong loop during the interchange it might result in a miscompilation.

include/llvm/Transforms/Scalar.h
136

Another common motivation for loop interchange is the improved utilization of SIMD instructions which your pass incidentally does too?

lib/Transforms/Scalar/LoopInterchange.cpp
50

Seems like a useful utility function, maybe could be moved to Loop?

185

If this restriction can be removed, this pass would be really useful for OpenCL C implicit work-group vectorization in pocl.

191

Typo 'Worlist'

239

Better use an explicit NULL comparison? Not sure what the LLVM convention is here.

267

Typo dependece

271

Typo dependece

273

This part needs comments (what is being checked?).

298

Comment needed (purpose of the function?).

397

Doesn't this require the loops to be perfectly nested? Is it being (implicitly) checked now?

673

Shouldn't this update the loop metadata?

Hi James,Pekka,
Thanks for the comments and spending your valuable time on this. Please find my comments.

[James] You mention you only handle 2-deep nested loops. But actually, you can handle any level of nesting by doing pairwise swaps working from  innermost nest out.

Yes i agree we can extend this to handle any level of interchange. But since i 'm writing a loop opt for the first time i was thinking to implement a level 2 interchange first and then build upon it. Will that be ok?

[James] Your function to determine whether to swap or not is too simplistic. It will only catch a very small number of cases (GEPs from the IV). The obvious way to do this is to grab the SCEV for each load and store, and if it is an AddRecExpr, check the step value.

I will look into this in more detail. It would be great if you could point out an example were the current check would fail that could have been caught using SCEV. One thing i though about was a scalar reduction variable in a loop but can't we catch it using DependencyAnalysis pass as well? I'm a bit new to Loop optmitization it would be great if you could clarify this a bit more.

[James]  Once you have split the latch, I'd have thought you could just split the PHIs from the top of each header block. This would give you a bunch of blocks with branches between them - just rearrange the branches (and perform trivial PHI updates)

I have tried to do the same in the current implementation in adjustLoopLinks after splitting innerlooplatch and inner loop header. I will try to simplify the code a bit more in the upcoming patch update.

Hi Pekka,
Thanks for the review. Please find my comments inline.

Is it ok if i move the updated patch to llvm-commits mailing list for further review?
Thank you once again for your valuable comments i will get back with an updated patch shortly.
Regards
Karthik Bhat

include/llvm/Transforms/Scalar.h
136

Yes i agree. One thing i can think of is adjusting the LoopInterchangeProfitability to keep loops of stride 1 as the innermost loop when possible which could help loop vectorizer etc.

lib/Transforms/Scalar/LoopInterchange.cpp
50

Sure we can move this utility to Loop. Will modify in my next patch update.

185

Yes i agree. The final goal is to make this pass generic for any level. But since i'm a bit new to Loop optmizations I was thinking to do this in steps first by handling loops of depth 2 and then building on it to support for any levels.

191

Will update this and share the patch shortly.

273

If we have detected an anti dependency between 2 memory instructions here we try to get the dependency distance or direction of the dependence for each loop nesting level.
We can interchange any 2 levels of the loop nest only if we do not have positive distance or direction in those levels. This will ensure that rearranging is legal. I reffered https://engineering.purdue.edu/~milind/ece573/2011spring/lecture-14.pdf for legality of loop interchange transform.

298

This is current limitations in our transform because of which we are not interchanging a vaild nested loop.This will be finally removed when we completly fix the transform part.

397

Yes we need to check for perfectly nested loops in legality. I have added the code for it and will share the updated code shortly.

673

Currently we only update the successor node of the branch instruction of the loop latch. So the metadata portion should be intact.
I'm not sure if we have to update the metadata portion after moving outer loop as inner loop. It would be great if you could help me out with an example were we might have to update the metadata due to reordering.

lib/Transforms/Scalar/LoopInterchange.cpp
273

OK. Worth adding this as a comment to the code as this is not obvious to a reader new to the loop interchange optimization.

673

OK. I didn't read the transformation part in detail so I cannot immediately tell if it breaks or not. If the loop id still points to the correct original loop, it should be OK.

suyog added a subscriber: suyog.Feb 6 2015, 10:31 AM
hfinkel added a subscriber: Unknown Object (MLST).Feb 6 2015, 2:03 PM
hfinkel added a subscriber: hfinkel.Feb 6 2015, 2:10 PM

After this step the loop is interchanged. It gives the correct results for few of the sample loops which i checked but I'm not sure if this is the right way to perform interchange transform. I suppose i will have to update the dom tree etc?

The method sounds reasonable (as James pointed out, I think there might be a simpler implementation). You only have to update the domtree if you claim to preserve it (same with LoopInfo, for example).

Is it ok if i move the updated patch to llvm-commits mailing list for further review?

Yes, please do. You should have done this from the very beginning (and subscribed llvm-commits to this review, unfortunately, none of the (fairly extensive) comments here were mirrored to the list).

Also, there are a couple of test cases in the TSVC benchmark (which is in our test suite), where loop interchange would enable vectorization. It would be nice to know that we get those cases with this pass.

Hi Karthik,

I will look into this in more detail. It would be great if you could point out an example were the current check would fail that could have been caught using SCEV. One thing i though about was a scalar reduction variable in a loop but can't we catch it using DependencyAnalysis pass as well? I'm a bit new to Loop optmitization it would be great if you could clarify this a bit more.

Sure. Consider this:

int a = ...;
for (i : ...) {
  for (j : ...) {
    x[j][i+a];
  }
}

This will produce an add, then a GEP. Your code will fail to match it as it won't see through the add. Also, the add is of an unknown quantity, but SCEV knows that this is loop invariant, so SCEV can help you here.

Hi James,Hal,
Thanks for your inputs. Hal I will move this to llvm commits shortly was working on few issues which i found during testing.
James i tried the example-

 int a = 2;
 for (i=0;i<N;++i) {
  for (j=0;j<M;++j) {
    A[j][i+a] = A[j][i]+1;
  }
}

The dependency analysis returns an anti dependecy of [-2 0] between the load and store and we interchange successfully.
But if add is unknown then i assume it is not safe to interchange as we do not know for sure what dependecy exits.

Dependency Analysis seems to be internally using ScalarEvolution.
If there is an loop independent variable it is captured using |< symbol.
e.g.
In the matrix multiplication code-

for(int i=0;i<N;i++)
  for(int j=0;j<M;j++)
    for(int k=0;k<K;k++)
      A[i][j]=A[i][j]+B[i][j]*C[j][k];

we get a dependecy vector of -

consistent anti [0 0|<]

i.e. the load is independent of inner most iterator (k). So in this case we can move it to outermost as other 2 dependency are 0 (or =) and the code will be vectorize.

The reason I was using this module was because it was giving me the required direction/distance vector/matrix which i can use to decide legality/profitability of interchange.(Based on "Optimizing Compilers for Modern Architectures: A Dependence-Based Approach" by Ken Kennedy)
I think we can use ScalarEvolution to construct the direction/distance matrix as well based on your suggestion but i thought Dependency Analysis already does that for us. I will recheck.

Thanks
Karthik Bhat

karthikthecool abandoned this revision.Mar 8 2015, 9:52 PM

Patch was moved to D7499 and committed as r231458.