This is an archive of the discontinued LLVM Phabricator instance.

[LoopCacheAnalysis] Consider dimension depth of the subscript reference when calculating cost
ClosedPublic

Authored by bmahjour on Apr 8 2022, 9:18 AM.

Details

Summary

As described in https://reviews.llvm.org/D122776, the current LoopCacheCost analysis is not able to determine profitable nesting for outer loops of a nest more than 2 levels deep. For example consider the first loop in the existing LIT test llvm/test/Analysis/LoopCacheAnalysis/PowerPC/matvecmul.ll. The loop looks like this:

;   for (int k=1;k<nz,++k)
;      for (int j=1;j<ny,++j)
;        for (int i=1;i<nx,++i)
;          for (int l=1;l<nb,++l)
;            for (int m=1;m<nb,++m)
;                 y[k+1][j][i][l] = y[k+1][j][i][l] + b[k][j][i][m][l]*x[k][j][i][m]

and the cost for the k, j and i loops are all calculated to be 30000000000. The problem is that when considering a subject loop as the inner most loop of the nest, if the access pattern is not consecutive, the cost function returns the trip count of that loop as the estimated number of cache lines accessed. If the trip counts are equal (or unknown in which case we assume a default value of 100) then the costs of the outer loops will be identical. The cost function needs to give more weight to the reference groups that use a function of the loop's IV as a subscript into outer dimensions. This patch tries to do that by multiplying the trip counts of the loops corresponding to subscripts that come between the subject loop and inner dimensions. (ie for a given reference group, the farther away a subscript from the innermost level, the higher the cost of moving the corresponding loop into the innermost position).

Diff Detail

Event Timeline

bmahjour created this revision.Apr 8 2022, 9:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 9:18 AM
bmahjour requested review of this revision.Apr 8 2022, 9:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 9:18 AM
etiotto accepted this revision.Apr 11 2022, 7:22 AM

LGTM

This revision is now accepted and ready to land.Apr 11 2022, 7:22 AM
amehsan added a subscriber: amehsan.EditedApr 11 2022, 8:12 AM

Hi Bardia

Thanks for your work on this. While I keep an eye on the work on loop interchage, I don't have enough details in mind to give a detailed review here. Just two things occured to me that may be worthwhile to mention

  1. Boiling down various parameters to one final "cost" value has a disadvantage. In fact we discussed this with @congzhe as the first solution, but we discarded it because essentially you lose some informaiton when you boil everything down to one number. The reason that Congzhe decided to go with a two component cost was that extra detail, carries more useful information and helps to make more accurate decision. Now this was a theoretical concern and I am not sure if in practice it holds or not. But that is the background on congzhe's work.
  2. Wouldn't it be more appropriate to continue discussion on congzhe's patch rather than posting a parallel patch to address the same problem? I think that will be more constructive given congzhe has already identified the problem, investigated some solutions and developed a patch? I suggest we continue the discussion, as review comments on his patch. Once there is an agreement on the cost function, Congzhe can update the patch (assuming it is needed) to reflect the agreed upon solution.
  1. Boiling down various parameters to one final "cost" value has a disadvantage. In fact we discussed this with @congzhe as the first solution, but we discarded it because essentially you lose some informaiton when you boil everything down to one number. The reason that Congzhe decided to go with a two component cost was that extra detail, carries more useful information and helps to make more accurate decision. Now this was a theoretical concern and I am not sure if in practice it holds or not. But that is the background on congzhe's work.

The cost returned by the analysis must consider the impact of outer dimension strides incurred by each reference group, otherwise it will not be accurate. If, for whatever reason, we need extra details in future we could provide extra interfaces to the analysis to query those details.

  1. Wouldn't it be more appropriate to continue discussion on congzhe's patch rather than posting a parallel patch to address the same problem? I think that will be more constructive given congzhe has already identified the problem, investigated some solutions and developed a patch? I suggest we continue the discussion, as review comments on his patch. Once there is an agreement on the cost function, Congzhe can update the patch (assuming it is needed) to reflect the agreed upon solution.

I explained my reasoning for needing a different approach in D122776. I just figured it would be a lot easier to show what I mean by posting a new patch instead of trying to describe it as a change request on top of D122776. I'm happy to continue further discussion on this wherever seems more appropriate to you or other reviewers.

I explained my reasoning for needing a different approach in D122776. I just figured it would be a lot easier to show what I mean by posting a new patch instead of trying to describe it as a change request on top of D122776. I'm happy to continue further discussion on this wherever seems more appropriate to you or other reviewers.

Thanks Bardia….Since this patch already has a LGTM, perhaps the easiest thing is to merge this. Just to clarify my comment, first I want to appreciate your input to the discussions and your effort in reviewing the patches, looking into papers and in general feedback that you provide. I think it is more encouraging if we avoid creating parallel patches, once a patch is under discussion, to recognize the effort that has gone into identifying the problem, investigating different solutions, and creating the initial patch. That does not mean to imply that reviewing the patch is an easy task and in some cases like what you have done, it involves reading a paper, or some other significant contribution. But as long as the author of original patch is active and receptive to feedback I don’t see a reason for parallel patches. I am open to alternative opinions from others.

congzhe accepted this revision.EditedApr 12 2022, 5:59 PM
  1. Boiling down various parameters to one final "cost" value has a disadvantage. In fact we discussed this with @congzhe as the first solution, but we discarded it because essentially you lose some informaiton when you boil everything down to one number. The reason that Congzhe decided to go with a two component cost was that extra detail, carries more useful information and helps to make more accurate decision. Now this was a theoretical concern and I am not sure if in practice it holds or not. But that is the background on congzhe's work.

The cost returned by the analysis must consider the impact of outer dimension strides incurred by each reference group, otherwise it will not be accurate. If, for whatever reason, we need extra details in future we could provide extra interfaces to the analysis to query those details.

  1. Wouldn't it be more appropriate to continue discussion on congzhe's patch rather than posting a parallel patch to address the same problem? I think that will be more constructive given congzhe has already identified the problem, investigated some solutions and developed a patch? I suggest we continue the discussion, as review comments on his patch. Once there is an agreement on the cost function, Congzhe can update the patch (assuming it is needed) to reflect the agreed upon solution.

I explained my reasoning for needing a different approach in D122776. I just figured it would be a lot easier to show what I mean by posting a new patch instead of trying to describe it as a change request on top of D122776. I'm happy to continue further discussion on this wherever seems more appropriate to you or other reviewers.

Thanks Bardia for this work, I think the approach in this patch does resolve the motivating problem described in D122776. Nevertheless, I would also like to clarify that my approach in D122776 does consider outer dimension strides incurred by each reference group (I should clarify this in D122776 though). Currently what loop cache analysis does is that for each loop we sum all costs of all reference groups and get the final cost (which is what your patch does as well). What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

Treating stride as a secondary component is what I respectfully objected to, and explained earlier. Not sure if taking the maximum stride would give us what we need. For example consider

for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[i][k][j]
       ... C[i][k][j]

the maximum stride will be the same for both i-j-k and i-k-j configurations (despite the second one being more profitable) bringing us back to the original problem.

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

Isn't llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll providing the same test coverage? Note that the analysis is not sensitive to the order of the loops within the loop nest, as it considers all permutations regardless of the original order.

What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

Treating stride as a secondary component is what I respectfully objected to, and explained earlier. Not sure if taking the maximum stride would give us what we need. For example consider

for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[i][k][j]
       ... C[i][k][j]

the maximum stride will be the same for both i-j-k and i-k-j configurations (despite the second one being more profitable) bringing us back to the original problem.

IMHO for this case the cost of loop-k would be higher than loop-j (remember that we compare the cost first and then stride). So loop cache analysis does output the i-k-j pattern.

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

Isn't llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll providing the same test coverage? Note that the analysis is not sensitive to the order of the loops within the loop nest, as it considers all permutations regardless of the original order.

The test case in D122776 is the one that really shows the impact of our work, which is why I developed that test. Without our work loop cache analysis would fail that test -- it would output the loop vector as [j, i, k] which is not the optimal access pattern.

The current llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll (and the tests updated in this patch) do not expose the problem we are working on. Current loop cache analysis does already output the optimal access pattern for those tests, which might make it not clear enough why we want to improve loop cache analysis.

What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

Treating stride as a secondary component is what I respectfully objected to, and explained earlier. Not sure if taking the maximum stride would give us what we need. For example consider

for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[i][k][j]
       ... C[i][k][j]

the maximum stride will be the same for both i-j-k and i-k-j configurations (despite the second one being more profitable) bringing us back to the original problem.

IMHO for this case the cost of loop-k would be higher than loop-j (remember that we compare the cost first and then stride). So loop cache analysis does output the i-k-j pattern.

Sorry I made a mistake in the example above. I meant to consider this example:

for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[j][i][k]
       ... C[j][i][k]

Here the optimal order is j-i-k, but if we take the maximum among all reference groups we'll end up with the same value for both the i-j-k and j-i-k configurations. With this patch the j-loop will have a cost that is larger than the i-loop and we get the optimal permutation:

Loop 'j' has cost = 201000000
Loop 'i' has cost = 102000000
Loop 'k' has cost = 90000

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

Isn't llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll providing the same test coverage? Note that the analysis is not sensitive to the order of the loops within the loop nest, as it considers all permutations regardless of the original order.

The test case in D122776 is the one that really shows the impact of our work, which is why I developed that test. Without our work loop cache analysis would fail that test -- it would output the loop vector as [j, i, k] which is not the optimal access pattern.

The current llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll (and the tests updated in this patch) do not expose the problem we are working on. Current loop cache analysis does already output the optimal access pattern for those tests, which might make it not clear enough why we want to improve loop cache analysis.

The current loop cache analysis outputs the loops in the correct order by luck (because it maintains the original breath-first order and that order just happens to be the optimal order), but it outputs the same cost value for the two outer loops, which is the root problem! By ensuring that the correct and distinguishable cost is associated with each loop we also ensure that the optimal order is maintained. I do see your point in wanting to make sure that the sort order is correct, but if that's the case we probably want to use CHECK-NEXT instead of CHECK-DAG for your test.

The current loop cache analysis outputs the loops in the correct order by luck (because it maintains the original breath-first order and that order just happens to be the optimal order), but it outputs the same cost value for the two outer loops, which is the root problem! By ensuring that the correct and distinguishable cost is associated with each loop we also ensure that the optimal order is maintained. I do see your point in wanting to make sure that the sort order is correct, but if that's the case we probably want to use CHECK-NEXT instead of CHECK-DAG for your test.

Sure, I will update the test case in D122776 to use CHECK-NEXT instead of CHECK-DAG, and update the cost numbers therein according to your improvement in this patch. Thanks again for all the discussions!

bmahjour updated this revision to Diff 423938.EditedApr 20 2022, 9:39 AM

Added a test for the multi-store case I mentioned above.

Hi Bardia @bmahjour , my apologies that I did not clearly remember the discussion during the loopopt meeting -- was it like you would like me to commit this patch for you, or was it like you would like me to keep an eye on it after you commit it?

bmahjour updated this revision to Diff 426460.May 2 2022, 11:55 AM

Sorry for the delay, as I didn't get a chance to commit this before I went on vacation.

I just noticed that after fixed-size delinearization commit went in, this patch needed to be re-based and new values needed to be generated for the newly added LoopnestFixedSize.ll. I also noticed that this test exposes an issue with the analysis producing large cost values that can overflow the underlying int64_t type. This problem is not specific to this patch, as described in https://github.com/llvm/llvm-project/issues/55233. I think the overflow problem should be dealt with as a separate issue and will add it to the agenda for the next LoopOptWG call. In the mean time I've changed LoopnestFixedSize.ll to use smaller trip counts. @congzhe please let me know if you are ok with this change. We should recreate a test with large sizes as part of fixing 55233.

Sorry for the delay, as I didn't get a chance to commit this before I went on vacation.

I just noticed that after fixed-size delinearization commit went in, this patch needed to be re-based and new values needed to be generated for the newly added LoopnestFixedSize.ll. I also noticed that this test exposes an issue with the analysis producing large cost values that can overflow the underlying int64_t type. This problem is not specific to this patch, as described in https://github.com/llvm/llvm-project/issues/55233. I think the overflow problem should be dealt with as a separate issue and will add it to the agenda for the next LoopOptWG call. In the mean time I've changed LoopnestFixedSize.ll to use smaller trip counts. @congzhe please let me know if you are ok with this change. We should recreate a test with large sizes as part of fixing 55233.

Thanks, I agree that we can merge this patch and then resolve the overflow issue.

This revision was landed with ongoing or failed builds.May 2 2022, 1:51 PM
This revision was automatically updated to reflect the committed changes.