If inner loop is loop independent or doesn't carry any dependency then, loop interchange is not profitable. Also if outer loop is not loop independent then, loop interchange is not profitable. if inner loop has dependence and outer loop is loop independent then, it is profitable to interchange to enable inner loop parallelism. Corrected the dependency checking inside isProfitableForVectorization(). Also addresses the endless interchange problem. If Cache analysis could decide the loop order then, isProfitable will not invoke isProfitableForVectorization for decidinding better loop interchange.
Details
- Reviewers
- fhahn - congzhe - Meinersbur - bmahjour 
- Group Reviewers
- Restricted Project 
- Commits
- rGee7188c8b2ab: [LoopInterchange] Correcting the profitability check
Diff Detail
Event Timeline
Before this patch, If outer loop dependency direction "=" and Inner loop dependency direction is "S" and "I" then, loop interchange is considered as profitable. Only two cases of dependency is profitable. But for vectorization, ">" and "<" dependency in outer loop is more profitable when interchanged. After patch [=,<] and [=,>] will be interchanged for vectorization.
The proper profitability analysis for loop interchange uses CacheCost, although there are cases where it may be unable to analyze certain accesses (eg due to delinearization issues). It would be good to understand why we don't go through the CacheCost path in your use case.
The legacy heuristic doesn't look right to me, although your changes make the logic more aligned with the comments in the code.
| llvm/test/Transforms/LoopInterchange/pr43797-lcssa-for-multiple-outer-loop-blocks.ll | ||
|---|---|---|
| 8 | I'm worried about losing functional coverage by avoiding interchange here due to profitability. Can you play with -loop-interchange-threshold or make slight changes to the memory accesses to make the test case profitable? | |
I thought what you meant is that after this patch, [<, =] and [>, =] (not [=,<] and [=,>]) will be interchanged? Because after interchange the dependency vector would become [=, <] and [=, >] respectively, which could improve potential parallelization and enable finer-grained parallelism, i.e., outer loop parallelism instead of inner loop parallelism. I think this is what isProfitableForVectorization() is supposed to be.
I wonder if it makes sense to you @bmahjour ?
I think profitability determination solely based on dependency matrix is fundamentally flawed. One obvious example is this
void f1() {
  for (int i = 0; i < n; i++)
    for (int j = 0; j < m; j++) {
      B[j][i] = A[j][i];
    }  
}Assuming A and B don't alias, there are no dependencies, nevertheless interchange is profitable.
Another example is:
void f2() {
  // > =
  for (int i = 0; i < n; i++)
    for (int j = 0; j < m; j++) {
      ... = A[i][j];
      A[i-1][j] = ...;
    }
}The dependence is carried by the outer loop, yet it's not profitable to interchange (since it would make both locality and parallelism worse).
The following example is profitable to interchange, but won't be recognized as profitable after this patch:
void f3() {
  // = =
  for (int i = 0; i < n; i++)
    for (int j = 0; j < m; j++) {
      ... = A[j][i];
      A[j][i] = ...;
    }
}I'm worried about losing functional coverage by avoiding interchange here due to profitability. Can you play with -loop-interchange-threshold or make slight changes to the memory accesses to make the test case profitable?
Reverted the changes in pr43797-lcssa-for-multiple-outer-loop-blocks.ll. In this case, CostMap is empty and calculated legacy cost is 0. For retaining the loop interchange functionality, lowered the loop interchange threshold to -1000 (added -loop-interchange-threshold=-1000).
In loop interchange, profitablility is determined first by the cost model and then by isProfitableForVectorization(). Loops in f1() and f3() will be interchanged by CacheCost checking. In profitability checking of loop interchange, CacheCost analysis is done first. So this patch will not affect the f1() and f3() cases.
For f2() case, CacheCost and Legacy cost model will not show profitability. If loops in f2() is loop interchanged then, Outer loop parallelization is possible, which could improve parallelism.
Could you add a test case that is not considered profitable where before it was not?
I understand the idea is that when the locality model says that interchanged and non-interchanged have the same profitability according to cache locality analysis, only then we fall back to a heuristic that determines whether maybe the innermost loop of the interchange loop nest can be vectorized when the one before vectorization could not.
Unfortunately LoopInterchangeProfitability::isProfitable is not structured like that. It falls back to isProfitableForVectorization whenever cache analysis does not want to interchange the loops. This includes the case where the old loop nest has better locality than the interchanged would have. This could result in a endless application of LoopInterchange if we would run it until no more change is to be made:
- Cache analysis thinks non-interchanged nest has more locality => fall back to isProfitableForVectorization which returns true => do the interchange
- Cache analysis thinks the interchange has more locality => do the interchange (with two interchanges we arrive at the original loop)
- Cache analysis thinks non-interchanged nest has more locality => fall back to isProfitableForVectorization which returns true => do the interchange again
- ...
The current LoopVectorize pass only supports innermost loops, hence you would want the dependencies carried by the outer loops so the LoopVectorize pass does not have to consider them.
Btw, why does LopInterchange use a DepMatrix when the dependencies for all instructions in the loop could be just summarized in a single vector?
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1116–1118 | Does you patch cover this TODO? | |
| 1120–1123 | Shouldn't the condition on Row[InnerLoopId] and Row[OuterLoopId] be exact opposite? That is, it is profitable if the innermost loop has loop-carried dependencies while the outer has not? | |
| 1162–1163 | Should this only be considered if InnerLoopId is actually an innermost loop (The only kind LoopVectorize can currently process)? | |
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1116–1118 | Corrected the dependency checking and corrected the comments. | |
| 1120–1123 | Corrected the dependency check. If inner loop has loop carried dependency and outer loop is loop independent then, loop interchange is considered as profitable for vectorization. | |
| llvm/test/Transforms/LoopInterchange/pr43797-lcssa-for-multiple-outer-loop-blocks.ll | ||
|---|---|---|
| 8 | These changes are not needed after correcting the dependency check in loop interchange. | |
Following changes are made as per the comments. If inner loop is loop independent or doesn't carry any dependency then, loop interchange is not considered as profitable. Also if outer loop is not loop independent then, loop interchange is not considered as profitable. If inner loop has dependence and outer loop is loop independent then, it is profitable to interchange to enable inner loop parallelism.
@Meinersbur and @bmahjour, Could you get the chance to review the code I modified as per your comment?
This looks better now, but the problem of "endless interchange" is still not addressed. Per discussion in the loop opt call, we should only use CacheCost when it was able to calculate a valid cost and the loops can be sorted based on "strictly less than" ordering relationship. Only when CacheCost result is indeterminate (eg. two loops have equal cost) or when it is unable to calculate the cost (eg. due to delinearization issues), we should fall back to the isProfitableForVectorization() function.
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1127 | not not -> not | |
This looks better now, but the problem of "endless interchange" is still not addressed. Per discussion in the loop opt call, we should only use CacheCost when it was able to calculate a valid cost and the loops can be sorted based on "strictly less than" ordering relationship. Only when CacheCost result is indeterminate (eg. two loops have equal cost) or when it is unable to calculate the cost (eg. due to delinearization issues), we should fall back to the isProfitableForVectorization() function.
Corrected the "endless interchange" possibility. Now isProfitableForVectorization() only invokes when two loops have equal cost or it is unable to calculate the cost.
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1127 | corrected | |
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1167 | ||
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1167 | This is exactly the else branch of the cache analysis logic. It does not consider the fallback Cost < -LoopInterchangeCostThreshold legacy cost model. Please avoid the code duplication and needing to loop up the CostMap again. | |
| 1169–1170 | Consider updating this message. Suggestion: "Interchanging loops not considered to improve cache locality nor vectorization." | |
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1156–1157 | The structure suggested in the LoopWG call. The more general pattern/refactoring would be: std::optional<bool> shouldInterchange = isProfitableAccordingLoopCacheAnalysis(..);
if (shouldInterchange.has_value()) 
  return shouldInterchange.get_value();
shouldInterchange = isProfitableAccordingInstrOrderCost(..);
if (shouldInterchange.has_value()) 
  return shouldInterchange.get_value();
shouldInterchange = isProfitableForVectorization(..);
if (shouldInterchange.has_value()) 
  return shouldInterchange.get_value();
emitOptimizationRemark("Don't know");However, this changes when the emitOptimizationRemark is called. If we would not want to change this, it would be (which corresponds to the current structure but with refactoring): std::optional<bool> shouldInterchange = isProfitableAccordingLoopCacheAnalysis(..);
if (!shouldInterchange.has_value()) {
  shouldInterchange = isProfitableAccordingInstrOrderCost(..);
  if (!shouldInterchange.has_value()) {
     shouldInterchange = isProfitableForVectorization(..);
  }
}
if (!shouldInterchange.has_value())
  emitOptimizationRemark("Don't know");
else if (!shouldInterchange.get_value())
  emitOptimizationRemark("Profitability heuristic indicates this loop is good as-is");although I would prefer the former over the nested if-else chain and instead emit different optimization remarks for each of the heuristic that indicates that/why the loops should (NOT) be interchanged. | |
As per the Michael 's comments. Code inside isProfitable function is cleaned.
Added test case. With two loop interchange passes make sure that, there is no endless interchange. Before this patch there is endless loop interchange possibility. There is no endless interchange after this patch.
I would suggest to revise the summary and/or the title to describe that this patch now also addresses the endless interchange problem, in addition to correction in isProfitableForVectorization().
| llvm/test/Transforms/LoopInterchange/profitability.ll | ||
|---|---|---|
| 238 ↗ | (On Diff #479293) | I would suggest to apply --check-prefix=PROFIT only to interchange_05() and not to other existing tests. Because this opt command line runs interchange twice and shows that there would have been endless interchange with interchange_05() before this patch but there is no endless interchange after this patch, which applies only to interchange_05() and is not related to other tests. | 
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1170 | Is it possible to make LoopInterchangeCostMeaningfulnessThreshold an opt flag so we can assign its value more flexibly? Possibly rename it to LegacyCostModelThreshold or whatever is more appropriate. static cl::opt<int> LegacyCostModelThreshold(
    "legacy-cost-threshold", cl::init(10), cl::Hidden,
    cl::desc("The threshold for the legacy cost model to be considered.")); | |
Added comments in Test case 05 for better understanding. 
Corrected the comment in Test case 01. The loop interchange decision is made from the Cache cost analysis and is not from vectorization.
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 1137 | nit: isProfitableAccordingToLoopCacheAnalysis, or isProfitablePerLoopCacheAnalysis | |
| 1163 | nit: isProfitableAccordingToInstrOrderCost, or isProfitablePerInstrOrderCost | |
| 1201 | It could be worth adding some comments for this function that describe what it does now, and how it prevents endless interchange from happening. | |
| llvm/test/Transforms/LoopInterchange/profitability.ll | ||
| 175 ↗ | (On Diff #482045) | nit: may leads->may lead | 
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 64 | I don't see why we need two options to control the legacy cost model. | |
| 1158 | We should return nullopt when/if InnerIndex == OuterIndex | |
| 1179 | since the idea is to call this function only when CC has failed or been indecisive, we don't need to pass CC to this function and check it here. | |
| llvm/test/Transforms/LoopInterchange/profitability.ll | ||
| 171–176 ↗ | (On Diff #482681) | reword: This tests to make sure, that multiple invocations of loop interchange will not undo previous interchange and will converge to a particular order determined by the profitability analysis. | 
| llvm/lib/Transforms/Scalar/LoopInterchange.cpp | ||
|---|---|---|
| 64 | This new parameter added as per @Meinersbur comments in Inline. This is considered as the upper limit of InstrOrderCost. | |
| 1158 | The CostMap is assigned with unique number to each loop. So this condition "InnerIndex == OuterIndex" will not satisfy. But LoopCost may be same, CC->getLoopCost(*InnerLoop) == CC->getLoopCost(*OuterLoop) checking can be added for returning nullopt. | |
perserve-lcssa.ll  corrected, Instruction order cost is 0, Not profitable to interchange.
loop-interchange-optimization-remarks.ll remark message corrected.
All comments are updated.
Hi @bmahjour, Changes are made as per your comments. Could you get the time to review it.
Hi @Meinersbur, Some changes are done over your suggested comments (https://reviews.llvm.org/D135808?id=474010#inline-1332682)
LGTM.
Sorry I didn't find time to review it.
| llvm/test/Transforms/LoopInterchange/perserve-lcssa.ll | ||
|---|---|---|
| 172 ↗ | (On Diff #485706) | The test fails for me with the latest version because of opaque pointers emitted by opt. Regeneration of the test should fix this. | 
I don't see why we need two options to control the legacy cost model.