Page MenuHomePhabricator

[LoopInterchange] Try to achieve the most optimal access pattern after interchange
ClosedPublic

Authored by congzhe on Feb 22 2022, 10:04 PM.

Details

Summary

Motivated by pr43326 (https://bugs.llvm.org/show_bug.cgi?id=43326). What is significant there is the loop is a triply-nested loop. And a slightly modified test case is as follows.

void f(int e[10][10][10], int f[10][10][10]) {
  for (int a = 0; a < 10; a++) {
    for (int b = 0; b < 10; b++) {
      for (int c = 0; c < 10; c++) {
        f[c][b][a] = e[c][b][a];
      }
    }
  }
}

The ideal optimal access pattern after running interchange is supposed to be the following

void f(int e[10][10][10], int f[10][10][10]) {
  for (int c = 0;  c < 10; c++) {
    for (int b = 0; b < 10; b++) {
      for (int a = 0; a < 10; a++) {
        f[c][b][a] = e[c][b][a];
      }
    }
  }
}

Currently we interchange two times, changing the loop order from a->b->c to a->c->b and c->a->b. In other words, we are limited to picking up the innermost loop and finding an order that is locally optimal for it. However, the pass failed to produce the optimal loop access order which is c->b->a. For more complex examples what we get would be somehow quite far from the globally optimal ordering.

What is proposed in this pattern is to do a "bubble-sort" fashion when doing interchange. By comparing neighbors in LoopList in each iteration, we would be able to move each loop onto a most appropriate place, hence this is an approach that tries to achieve the globally optimal ordering.

To be specific, for the motivating example, the original order is

a->b->c

In the first iteration, we first interchange b and c, which gives us

a->c->b,

then we interchange a and c, which gives us

c->a->b.

Then in the second iteration, we interchange a and b, which gives us

c->b->a,

which is the ideal ordering.

This approach is supposed to work for however complex cases as well. In fact for the real test case in pr43326, the loop has reductions and currently loop interchange is not able to handle multi-level reduction loops (more than 2 levels). I currently limit this patch to deal only with simple multi-level loops, and I would be willing to post follow-up patches that enable interchange with multi-level reductions.

Comments on this patch are appreciated :)

Diff Detail

Event Timeline

congzhe created this revision.Feb 22 2022, 10:04 PM
congzhe requested review of this revision.Feb 22 2022, 10:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 22 2022, 10:04 PM
congzhe edited the summary of this revision. (Show Details)Feb 22 2022, 10:06 PM
congzhe edited the summary of this revision. (Show Details)Feb 22 2022, 10:09 PM
congzhe added reviewers: Whitney, bmahjour, Meinersbur, Restricted Project.Feb 22 2022, 10:46 PM
congzhe edited the summary of this revision. (Show Details)Feb 22 2022, 10:47 PM
Meinersbur added inline comments.
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
281

If you don't assume the vector is empty when starting (assert(LoopList.empty())?), consider clearing the list at the start of the function.

Change seems NFC anyway.

417

Why make it an object member?

502–516

Could you add the motivation to use bubble sort to the comment here?

Any particular reason to start with moving the innermost loops inwards (decreasing j) instead the other way around?

507

There should also be an early abort if there was no interchange during an entire round of i (like https://en.wikipedia.org/wiki/Bubble_sort#Pseudocode_implementation : until not swapped)

llvm/test/Transforms/LoopInterchange/phi-ordering.ll
12

Why changing this test?

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2022, 6:31 PM

Overall looks correct to me. Since loops are usually not that deep, I have no concern with the O(n^2) time of bubble sort.

congzhe updated this revision to Diff 419842.Apr 1 2022, 1:03 PM

Thanks a lot for the comments @Meinersbur! I've updated the patch accordingly.

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
281

Thanks for the comment, I added the assert correspondingly.

417

You are correct that it may not be necessary to make it an object member, I moved LoopList to inside the run() function.

502–516

Thanks, comments updated.

There is no particular reason that we start with moving the innermost loops outwards. Previously loop interchange picks the innermost loop and try our best to move it outwards, so in this patch I started with the innermost loop as well, to make it more "consistent" with the previous repo.

507

Thanks, I added an early abort accordingly.

llvm/test/Transforms/LoopInterchange/phi-ordering.ll
12

The original test ignores loop interchange cost model (with -loop-interchange-threshold=-1000) and does interchange two times (which does not generate good memory access pattern).

If I did not change the test, with the bubble sort fashion it would interchange every neighboring loops due to ignored cost model, and will eventually interchange three times generating a completely different output as compared to the original test case.

If I only set -loop-interchange-threshold=0, it would interchange only once which does give good memory access pattern, but the output would still be quite different from the current test.

Therefore I also changed array @A from 1-dimension to 2-dimension, thus it does interchange twice, and generated an output that is as close to the original one as possible.

congzhe updated this revision to Diff 419846.Apr 1 2022, 1:19 PM
Meinersbur accepted this revision as: Meinersbur.Apr 4 2022, 6:05 PM

LGTM

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
430–431

One less call of a copy ctor.

This revision is now accepted and ready to land.Apr 4 2022, 6:05 PM
congzhe updated this revision to Diff 420967.Apr 6 2022, 12:09 PM

Minor update to address the last comment, I'll land it shortly.

This revision was landed with ongoing or failed builds.Apr 6 2022, 12:32 PM
This revision was automatically updated to reflect the committed changes.