This is an archive of the discontinued LLVM Phabricator instance.

[LoopInterchange] New cost model for loop interchange
ClosedPublic

Authored by congzhe on May 4 2022, 7:48 AM.

Details

Summary

This patch proposed to replace the current loop interchange cost model by a new one, i.e., one that is returned from loop cache analysis.

Motivation behind

Given a loopnest, what loop cache analysis returns is a vector of loops [loop0, loop1, loop2, ...] where loop0 should be replaced as the outermost loop, loop1 should be placed one more level inside, and loop2 one more level inside, etc. What loop cache analysis does is not only more comprehensive than the current cost model, it is also a "one-shot" query meaning that we only need to query once during the entire loop interchange pass, which is better than the current cost model where we query it every time we check whether it is profitable to interchange two loops. Thus there is complexity saving, especially after D120386 where we do more interchanges to get the globally optimal loop access pattern.

Changes made to test cases

There's some changes to lit tests. Most of them are minor. One change that applies to all tests is that I changed the target triple from "x86_64" to "aarch64" and added "-mcpu=tsv110" in the RUN lines. This is because loop cache analysis needs the cache line size, which is given by "TTI.getCacheLineSize()". However, for x86 subtargets the "getCacheLineSize()" method is not implemented hence "TTI.getCacheLineSize()" would just return 0. This change I made makes "TTI.getCacheLineSize()" return a valid number and hence loop cache analysis can proceed as normal.

*update:* now the target triple are changed to powerpc as per comments.

interchange-no-deps.ll: removed the test function "no_bad_order()" because it is in fact irrelevant to situations for loop interchange and only aims at testing the legacy cost model (operands in gep instructions, to be more specific). The memory access is irrelevant to the outer loop and the outer loop should have been deleted, hence the function is not applicable to loop interchange. Therefore this is not a typical IR that we would encounter in real situations. The new and legacy cost models give different results for this function, so I just removed this test function since it is in fact not applicable to loop interchange.

interchanged-loop-nest-3.ll, not-interchanged-loop-nest-3.ll, not-interchanged-tightly-nested.ll: the IR was not really entirely correct, since the target triple specified was "x86_64" but the getelementptr indices are 32 bits. The indices should be 64 bits since pointer arithmetric should be 64 bit. So I changed them from i32 to i64, otherwise it will trigger SCEV assertion failures when running loop cache analysis, which says "scev operand types mismatch".

A note: currently we did not completely remove the legacy cost model, but keep it under an opt flag. This is because currently if we only used the new cost model, some lit tests would fail. The reason is that we leverage delinearization in loop cache analysis , and delinearization needs some enhancement -- currently it cannot successfully delinearize some tests and loop cache analysis would just bail out. I'll put enhancement of delinearization into my next-steps.

Diff Detail

Event Timeline

congzhe created this revision.May 4 2022, 7:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 7:48 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
congzhe requested review of this revision.May 4 2022, 7:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 4 2022, 7:48 AM
congzhe updated this revision to Diff 429392.May 13 2022, 5:04 PM
congzhe retitled this revision from [WIP][LoopInterchange] New cost model for loop interchange to [LoopInterchange] New cost model for loop interchange.May 13 2022, 5:42 PM
congzhe edited the summary of this revision. (Show Details)
congzhe edited the summary of this revision. (Show Details)May 13 2022, 5:42 PM
congzhe edited the summary of this revision. (Show Details)May 13 2022, 6:03 PM
congzhe edited the summary of this revision. (Show Details)May 13 2022, 6:13 PM
congzhe added reviewers: bmahjour, Meinersbur, Restricted Project.
congzhe set the repository for this revision to rG LLVM Github Monorepo.
congzhe added a project: Restricted Project.
congzhe edited the summary of this revision. (Show Details)

Other than the inline comments looks good to me.

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
63

Not sure if we need an option, I slightly prefer removing it, because the fact that we have an "EnableLegacy" option for it that defaults to "true" might incorrectly imply that we still use the old cost model where in fact we use the new one for majority of cases.

512

Add a comment about what the index reprsents (ie the optimal order of loops in the nest).

1180

I think we should try to limit the use of legacy cost model only to cases were LoopCacheCost fails due to delinearization, as you mentioned in the description. Can we do it like this:

if (CostMap.find(InnerLoop) != CostMap.end() &&
      CostMap.find(OuterLoop) != CostMap.end()) {
...
} else {
   Cost = getInstrOrderCost();
    LLVM_DEBUG(dbgs() << "Cost = " << Cost << "\n");
    if (Cost < -LoopInterchangeCostThreshold)
      return true;
}
1192

Can we unify the optimization remark output for both cost models? I don't think the specific cost values are of interest to the user, so just saying whether we found the loops profitable or not should suffice. For developers interested in the actual cost, they can get it through -debug output.

llvm/test/Transforms/LoopInterchange/call-instructions.ll
2

If we use powerpc64le-unknown-linux-gnu as the target triple, then we wouldn't need to specify -mcpu, since getCacheLineSize() returns valid cache line size for all Power subtargets.

congzhe updated this revision to Diff 431049.May 20 2022, 2:09 PM
congzhe updated this revision to Diff 431054.May 20 2022, 2:25 PM
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
63

Thanks, that's a good point. I've removed this flag.

congzhe added inline comments.May 20 2022, 2:26 PM
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
512

Thanks and updated accordingly.

1180

Thanks, I updated the code as you suggested.

llvm/test/Transforms/LoopInterchange/call-instructions.ll
2

I've now used powerpc64le-unknown-linux-gnu for all tests.

eopXD added a subscriber: eopXD.May 21 2022, 5:23 AM

There's some changes to lit tests. Most of them are minor. One change that applies to all tests is that I changed the target triple from "x86_64" to "aarch64" and added "-mcpu=tsv110" in the RUN lines.

IIUC, this means that IndexInner = CostMap.find(InnerLoop)->second = IndexOuter = CostMap.find(OuterLoop)->second = 0 here, right? Will your change possibly take away some coverage from x86? I think having a pre-commit to fill in something in x86's TTI to make sure its working fine on the new model would be fine. We can have multiple RUN lines on different architectures.

There's some changes to lit tests. Most of them are minor. One change that applies to all tests is that I changed the target triple from "x86_64" to "aarch64" and added "-mcpu=tsv110" in the RUN lines.

IIUC, this means that IndexInner = CostMap.find(InnerLoop)->second = IndexOuter = CostMap.find(OuterLoop)->second = 0 here, right?

Thanks for the comment!

With "x86_64" target triple since TTI->getCachelineSize() returns 0, loop cache analysis would execute analysis incorrectly, which means the loop vector it output may not be correct. What this indicates in loop interchange is that, IndexInner/IndexOuter might be messed up such that even if InnerLoop should be placed as the outer loop and OuterLoop should be placed as the inner loop, IndexInner might not be smaller than IndexOuter and hence we won't do interchange.

Will your change possibly take away some coverage from x86? I think having a pre-commit to fill in something in x86's TTI to make sure its working fine on the new model would be fine. We can have multiple RUN lines on different architectures.

IMHO I don't think we necessarily lose coverage for x86. Loop interchange is a mid-end pass that is not dependent on specific backend targets, although we need a little bit information from backends which is CachelineSize. Whichever "target triple" we use in our tests won't affect the functionality of this pass (and the test coverage). Regarding a pre-commit of "getCachelineSize()": I'm leaning towards leaving the implementation of getCachelineSize() in different backends to experts who are most familiar with that specific backend, let it be x86, arm, arc, loongArch, RISCV, etc, since it looks like implementation of "getCachelineSize()" is missing in a few targets. I'd appreciate it if you could let me know your thoughts.

congzhe added a comment.EditedMay 25 2022, 11:43 AM

For now we are failing one test case in LICM: Transforms/LICM/lnicm.ll. The reason is as follows. Looking at the source code:

void test(int x[10][10], int y[10], int *z) {
  for (int k = 0; k < 10; k++) {
    int tmp = *z;
    for (int i = 0; i < 10; i++)
      x[i][k] += y[k] + tmp;
  }
}

With the new cost model, loop cache analysis could analyze the access to array y but could not analyze the access to array x since it could not delinearize fixed-size array x. This is because the IR uses a chain of getelementptrs to access x which cannot be delinearized currently (as we discussed before). Hence loop cache analysis does the analysis only based on y and decides not to interchange. Ideally it should analyze both x and y and possibly decides to interchange the loops (whether to interchange or not depends on the tripcount numbers though). Currently this test case assumes we should interchange which we do not, and therefore the test fails.

There's multiple ways to resolve it. One possible way in my mind is that, we could change the following code in populateReferenceGroups() in LoopCacheAnalysis.cpp from:

if (!R->isValid())
    continue;

to

if (!R->isValid())
    return false;

which means that if it finds an IndexedReference that is not analyzable, it bails out with an empty loop vector and indicates it would not produce results if there's mem access that is not analyzable. IMHO it does make some sense because if loop cache analysis cannot successfully analyze all mem accesses, it should bail out otherwise incorrect decisions might be made due to missed info. In this case loop interchange will just fall back to the legacy cost model. If we go with this fix, I can post a simple patch for it.

Other possible fixes include modifying the gep instructions in Transforms/LICM/lnicm.ll to make mem accesses delinearizable and analyzable. I'd be fine with any possible solution. I'd appreciate it if you could let me know your thoughts? @bmahjour

bmahjour added a comment.EditedMay 26 2022, 1:53 PM

Other possible fixes include modifying the gep instructions in Transforms/LICM/lnicm.ll to make mem accesses delinearizable and analyzable. I'd be fine with any possible solution. I'd appreciate it if you could let me know your thoughts? @bmahjour

Overtime we'd want to improve delinearization and remove dependency on the legacy loop interchange cost model. I'd prefer changing the test case to become more friendly to delinearization. That test case needs to be rewritten soon to get rid of typed pointers anyway, so might as well use opaque pointers now and make the arrays non-fixed size.

congzhe updated this revision to Diff 432741.May 28 2022, 10:47 AM
congzhe edited the summary of this revision. (Show Details)

Updated the test case Transforms/LICM/lnicm.ll according to comments from Bardia @bmahjour, used opaque pointers and non-fixed size array. A few minor basic blocks were added in Transforms/LICM/lnicm.ll but the semantics did not change. Now delinearization works for it, and the new interchange cost model works fine - after running LNICM the loopnest will be interchanged.

bmahjour added inline comments.May 30 2022, 9:13 AM
llvm/lib/Transforms/Scalar/LoopInterchange.cpp
511

[nit] and populate the <Loop, index> pair into a map -> and put each <Loop, index> pair into a map...

516

Since CC can be nullptr we need to assert that that's not the case.

1188

[nit] IndexInner -> InnerIndex
[nit] IndexOuter -> OuterIndex

llvm/test/Transforms/LICM/lnicm.ll
6 ↗(On Diff #432741)

[nit] please add n and m as parameter. ie void test(int n, int m, int x[m][n], int y[n], int *z) {

20 ↗(On Diff #432741)

[nit] correct order based on the C code above.

bmahjour added inline comments.May 30 2022, 9:24 AM
llvm/test/Transforms/LICM/lnicm.ll
23 ↗(On Diff #432741)

Why can't this be kept as CHECK-NEXT (like before)? The load of z should be moved out by both LICM and LNICM I think.

congzhe updated this revision to Diff 433004.May 30 2022, 7:18 PM

Thanks Bardia for the comments, I've addressed them accordingly.

Regarding test/Transforms/LICM/lnicm.ll: the load of z is indeed moved out by both LICM and LNICM so there is no functional problems. The reason why the CHECK lines are changed here is that the new basic blocks I added in this patch (for ease of delinearization) will be involved in transformation of loop interchange, which is not the case before this patch.

With this patch:

  • With LICM the load of z is hoisted into entry and since loop interchange won't take place, the load of z stays in entry.
  • With LNICM the load of z is again hoisted into entry, and since entry is the preheader of the original outer loop, after interchange load of z is moved into for.body3.preheader, which is the new outer loop preheader. I tried to modify the IR aiming at keep the CHECK-NEXT lines that you mentioned like before, but I have not been successful.

Before this patch:

For the original test/Transforms/LICM/lnicm.ll before this patch, the entry block is not considered part of the loop (and a new outer preheader for.body.preheader will be generated) so once the load of z is hoisted to entry by either LICM or LNICM, it would always stay there since entry was not involved in transformation of loop interchange.

...With LNICM the load of z is again hoisted into entry, and since entry is the preheader of the original outer loop, after interchange load of z is moved into for.body3.preheader, which is the new outer loop preheader....before this patch, the entry block is not considered part of the loop ...

From what I can see, the "entry" is the preheader of the original outer loop in both cases, so not sure why it's treated differently. Maybe it's better to separate out the INTC, LNICM and LICM checks and show the full context for each set to better understand what's going on?

llvm/lib/Transforms/Scalar/LoopInterchange.cpp
517–518
1193–1194
congzhe updated this revision to Diff 433153.May 31 2022, 11:34 AM

...With LNICM the load of z is again hoisted into entry, and since entry is the preheader of the original outer loop, after interchange load of z is moved into for.body3.preheader, which is the new outer loop preheader....before this patch, the entry block is not considered part of the loop ...

From what I can see, the "entry" is the preheader of the original outer loop in both cases, so not sure why it's treated differently. Maybe it's better to separate out the INTC, LNICM and LICM checks and show the full context for each set to better understand what's going on?

I thought about too, it might be cleaner to separate out the INTC, LNICM and LICM checks. Patch is updated accordingly.

And thanks for finding out the typos :)

LGTM. Thanks!

bmahjour accepted this revision.May 31 2022, 12:47 PM
This revision is now accepted and ready to land.May 31 2022, 12:47 PM
This revision was automatically updated to reflect the committed changes.
kaz7 added a subscriber: kaz7.Jun 2 2022, 5:50 PM

Hi, after this patch, our buildbot for VE has been failing, https://lab.llvm.org/buildbot/#/builders/91/builds/9844. Is it possible to inspect these failures? Thanks.

congzhe added a comment.EditedJun 2 2022, 6:59 PM

Hi, after this patch, our buildbot for VE has been failing, https://lab.llvm.org/buildbot/#/builders/91/builds/9844. Is it possible to inspect these failures? Thanks.

Thanks for letting me know. On my local machine I am not seeing test failures. Would you please let me know which test and what command failed so I can look into it? The link above does not seem to contain information.

(Update) "The link above does not seem to contain information": this was my gateway problem and I can see the failures now. This is strange though because as I checked both before and after landing this patch, there is no test failures on my local machine (x86).

fhahn added a subscriber: fhahn.Jun 3 2022, 1:17 AM

Hi, after this patch, our buildbot for VE has been failing, https://lab.llvm.org/buildbot/#/builders/91/builds/9844. Is it possible to inspect these failures? Thanks.

Thanks for letting me know. On my local machine I am not seeing test failures. Would you please let me know which test and what command failed so I can look into it? The link above does not seem to contain information.

The issue is likely that all tests now depend on the PowerPC target to be built. This won't work as expected when LLVM is built with the PPC backend disabled. This is causing the test failures.

The tests as they are written at the moment would need something like ; REQUIRES: powerpc-registered-target to only run them when the target is available. If a lot of tests require a specific target to run, usually they are place in PowerPC sub-directory, which has a lit file limiting them to run only when the target is available.

Having all tests depend on a single target is not ideal, because this limits the number of bots/configurations that run the tests. I think we need a way to keep most of the tests independent of the PowerPC or any other target and only have tests that explicitly test the cost-modeling depend on the target.

fhahn added a comment.Jun 3 2022, 1:21 AM

interchanged-loop-nest-3.ll, not-interchanged-loop-nest-3.ll, not-interchanged-tightly-nested.ll: the IR was not really entirely correct, since the target triple specified was "x86_64" but the getelementptr indices are 32 bits. The indices should be 64 bits since pointer arithmetric should be 64 bit. So I changed them from i32 to i64, otherwise it will trigger SCEV assertion failures when running loop cache analysis, which says "scev operand types mismatch".

I don't think changing the IR here the right solution to resolving the crash. AFAICT it is perfectly legal to have GEPs with i32 operands even if the target has 64 bit pointers, they smaller arguments will be sign-extended if necessary (see getelementptr semantics in LangRef, which has If the offsets have a different width from the pointer, they are sign-extended or truncated to the width of the pointer.)

LoopCacheAnalysis should not crash on valid IR and needs fixing.

Having all tests depend on a single target is not ideal, because this limits the number of bots/configurations that run the tests. I think we need a way to keep most of the tests independent of the PowerPC or any other target and only have tests that explicitly test the cost-modeling depend on the target.

The analysis uses getCacheLineSize() which returns 0 by default unless a subtarget overrides it. I think value 0 is meant for machines with no cache model (in which case loop interchange will likely not matter). Ideally all other targets should implement this function, but for the time being two ways I can think of to solve this without making the tests target-dependent are:

  1. Define an option-controlled CacheLineSize inside LoopCacheAnalysis.cpp (with a default of say 64). If the option is specified we use the value supplied otherwise, we call getCacheLineSize() and use the value if it's non zero, or fall-back to the default value in case of zero.
  2. Halt the analysis and invalidate the result when getCacheLineSize() returns zero. Clients must check the validity of result before using it, or query if the analysis can be computed for the specified target. This looks undesirable to me, since it makes the analysis less consumable. Also there seem to be targets that can benefit from interchange but just haven't implemented getCacheLineSize() yet.

Having all tests depend on a single target is not ideal, because this limits the number of bots/configurations that run the tests. I think we need a way to keep most of the tests independent of the PowerPC or any other target and only have tests that explicitly test the cost-modeling depend on the target.

The analysis uses getCacheLineSize() which returns 0 by default unless a subtarget overrides it. I think value 0 is meant for machines with no cache model (in which case loop interchange will likely not matter). Ideally all other targets should implement this function, but for the time being two ways I can think of to solve this without making the tests target-dependent are:

  1. Define an option-controlled CacheLineSize inside LoopCacheAnalysis.cpp (with a default of say 64). If the option is specified we use the value supplied otherwise, we call getCacheLineSize() and use the value if it's non zero, or fall-back to the default value in case of zero.
  2. Halt the analysis and invalidate the result when getCacheLineSize() returns zero. Clients must check the validity of result before using it, or query if the analysis can be computed for the specified target. This looks undesirable to me, since it makes the analysis less consumable. Also there seem to be targets that can benefit from interchange but just haven't implemented getCacheLineSize() yet.

Thanks Florian for finding the cause, and thanks Bardia for your suggestions. On this thread I think I can try your solution #1 first. I remember I actually tried exactly this approach weeks ago internally and it worked (although I never posted it). Let me take a second look and if it does work fine, I'll update my patch accordingly and reset the target triple? For the ones that had x86 previously I can reset it to x86, and for the ones that do not have a target triple I can remove it.

Thanks @congzhe.

For the ones that had x86 previously I can reset it to x86, and for the ones that do not have a target triple I can remove it.

If there is no need for x86 triple, we should remove it from the tests as well.

BTW, I also agree with Florian's comment about GEPs with i32 operands, but I'd be ok to deal with it in a separate patch.

congzhe reopened this revision.Jun 6 2022, 10:51 AM
This revision is now accepted and ready to land.Jun 6 2022, 10:51 AM
congzhe updated this revision to Diff 434532.EditedJun 6 2022, 11:01 AM

Updated the patch following Bardia's suggestion: 1) used an option-controlled CacheLineSize inside LoopCacheAnalysis.cpp. If the option is specified we use the value supplied otherwise, we call getCacheLineSize() and use the value if it's non zero, or fall-back to the default value in case of zero. 2) Removed all target triple lines in all loop interchange tests.

I will address Florian's comment on GEPs with i32 operands in my next patch.

congzhe updated this revision to Diff 434538.Jun 6 2022, 11:07 AM

Hi Bardia @bmahjour , I'm wondering if you have comments on the most recent version? I'd appreciate it if you could let me know your thoughts.

bmahjour added a comment.EditedJun 8 2022, 8:43 AM

Some additional comments:

  1. Could you please add a test directly under llvm/test/Analysis/LoopCacheAnalysis (not under PowerPC) to make sure the analysis produces different but sane costs when a) neither target triple nor option specified and b) option is specified but with a value different from the default.
  2. I noticed some tests still have target datalayout. Do we need to specify target datalayout? If not please remove the target datalayout.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
51–52 ↗(On Diff #434538)
fhahn added inline comments.Jun 8 2022, 10:57 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
51–52 ↗(On Diff #434538)

The name and description seems to imply this specifies the cache-line size in general, but here it only applies to the use in CacheCost. Should the setting be directly be applied in TTI::getCacheLineSize?

Also, the change to add the option should probably be landed separately, with a test in llvm/test/Analysis/LoopCacheAnalsysis

Thanks @bmahjour @fhahn, according to your comments I posted a separate preliminary patch D127342 that puts the cache line size option in TargetTransformInfo.cpp, and added a new test case under llvm/test/Analysis/LoopCacheAnalysis to make sure loop cache analysis produces different but sane costs. After landing D127342 I'll rebase the current patch on D127342.

Should the setting be directly be applied in TTI::getCacheLineSize

I would personally be in favor of such a change, however as I mentioned earlier the current default value of 0 might have been purposefully chosen to indicate a computer with no cache. For example I see this comment in MCSubtargetInfo.h:

/// Return the target cache line size in bytes.  By default, return
/// the line size for the bottom-most level of cache.  This provides
/// a more convenient interface for the common case where all cache
/// levels have the same line size.  Return zero if there is no
/// cache model.
///
virtual unsigned getCacheLineSize() const {
  Optional<unsigned> Size = getCacheLineSize(0);
  if (Size)
    return *Size;

  return 0;
}

I don't think any of the targets supported run without some level of memory caching, but we need to run it by the wider community to make sure there are no surprises. Maybe a separate TTI hook for checking if a cache model exists would be the best compromise?

congzhe added a comment.EditedJun 9 2022, 11:37 AM

Should the setting be directly be applied in TTI::getCacheLineSize

I would personally be in favor of such a change, however as I mentioned earlier the current default value of 0 might have been purposefully chosen to indicate a computer with no cache. For example I see this comment in MCSubtargetInfo.h:

/// Return the target cache line size in bytes.  By default, return
/// the line size for the bottom-most level of cache.  This provides
/// a more convenient interface for the common case where all cache
/// levels have the same line size.  Return zero if there is no
/// cache model.
///
virtual unsigned getCacheLineSize() const {
  Optional<unsigned> Size = getCacheLineSize(0);
  if (Size)
    return *Size;

  return 0;
}

I don't think any of the targets supported run without some level of memory caching, but we need to run it by the wider community to make sure there are no surprises. Maybe a separate TTI hook for checking if a cache model exists would be the best compromise?

Thanks for the comment, I see your concern. Would you like me to continue D127342, or would you like me to still keep the cache line size option under LoopCacheCost.cpp as we did in this patch? To avoid breaking computers with no cache, I could update CLS in LoopCacheCost.cpp to the following (I may need to change to wording a little bit) :

CLS = DefaultCacheLineSize.getNumOccurrences() > 0  ? DefaultCacheLineSize : TTI.getCacheLineSize()

and pass -cache-line-size=64 to all loop interchange tests. This way the buildbot test failures can be avoided, and in the meantime for a realworld computer with no cache it won't break.

Thanks for the comment, I see your concern. Would you like me to continue D127342, or would you like me to still keep the cache line size option under LoopCacheCost.cpp as we did in this patch? To avoid breaking computers with no cache, I could update CLS in LoopCacheCost.cpp to the following (I may need to change to wording a little bit) :

CLS = DefaultCacheLineSize.getNumOccurrences() > 0  ? DefaultCacheLineSize : TTI.getCacheLineSize()

and pass -cache-line-size=64 to all loop interchange tests. This way the buildbot test failures can be avoided, and in the meantime for a realworld computer with no cache it won't break.

I slightly prefer to continue with D127342, assuming you bring it up with the wider community (eg. through Discourse) and there are no objections. Not having cache is an exception (at best) rather than the norm, so it doesn't make much sense to me that the default assumes no cache.
If that doesn't pan out I'm ok with your second suggestion, but we'd also need to teach LoopCacheCost to handle CLS == 0 (otherwise it would divide by zero).

congzhe added a comment.EditedJun 9 2022, 1:16 PM

Thanks for the comment, I see your concern. Would you like me to continue D127342, or would you like me to still keep the cache line size option under LoopCacheCost.cpp as we did in this patch? To avoid breaking computers with no cache, I could update CLS in LoopCacheCost.cpp to the following (I may need to change to wording a little bit) :

CLS = DefaultCacheLineSize.getNumOccurrences() > 0  ? DefaultCacheLineSize : TTI.getCacheLineSize()

and pass -cache-line-size=64 to all loop interchange tests. This way the buildbot test failures can be avoided, and in the meantime for a realworld computer with no cache it won't break.

I slightly prefer to continue with D127342, assuming you bring it up with the wider community (eg. through Discourse) and there are no objections. Not having cache is an exception (at best) rather than the norm, so it doesn't make much sense to me that the default assumes no cache.
If that doesn't pan out I'm ok with your second suggestion, but we'd also need to teach LoopCacheCost to handle CLS == 0 (otherwise it would divide by zero).

Thanks for the suggestions, I'll bring it up through Discourse and if there are no objections I'll continue with D127342.

Regarding "divide by zero" it seems to me that if CLS==0 in LoopCacheCost then isConsecutive() will always be false, hence we will never calculate this equation (TripCount*Stride)/CLS. But anyways this seems to be a subtle point - I'll pursue D127342 for now.

Matt added a subscriber: Matt.Jun 10 2022, 3:19 PM
congzhe added a comment.EditedJun 13 2022, 10:06 PM

Based on the discussion in the Discourse (https://discourse.llvm.org/t/rfc-targettransforminfo-add-an-option-to-supply-cache-line-size-if-not-provided-by-the-target/63114) and D127342, I'm thinking that there might be two ways to move forward.

  1. We can modify TargetTransformInfo.cpp in D127342 as below, and pass "-cache-line-size=64" to the RUN lines for loop interchange tests. Since as pointed out in the Discourse, some embedded devices have no cache so we could not default the cache line size to 64 in TargetTransformInfo.cpp. Moreover, loop interchange is a mid-end pass anyways so it might be difficult to take each backend target into account, we might just want to pass "-cache-line-size=64" as a general setup to loop interchange tests to ensure it works as expected in the mid-end.
unsigned TargetTransformInfo::getCacheLineSize() const {
  return CacheLineSize.getNumOccurrences() > 0 
             ? CacheLineSize
             : TTIImpl->getCacheLineSize();
}
  1. We could mimic the tests in loop date prefetch, where the tests do depend on specific backend targets, and are put under different directories, i.e., test/Transforms/LoopDataPrefetch/AArch64/large-stride.ll, test/Transforms/LoopDataPrefetch/PowerPC/basic.ll. So we could keep the powperpc target triple in loop interchange tests, and put them under test/Transforms/LoopInterchange/PowerPC.

I'm wondering if you think any of the two approaches make sense? @bmahjour

congzhe updated this revision to Diff 438771.Jun 21 2022, 11:27 AM

After D127342 is merged, we could try to reland this patch, by providing -cache-line-size=64 for each loop interchange test case. I've updated the patch accordingly.

I did not remove the target datalayout lines for the moment. As Michael pointed out, it represents the pointer size and so on. I actually tried to remove those lines but I hit one test failure in phi-ordering.ll (a SCEV crash in loop cache analysis). As Florian mentioned earlier, there should be some bug within loop cache analysis, and I'll look into it in my next patch as promised. For now I'm thinking we could try to get this patch landed firstly.

Comments are appreciated :)

After D127342 is merged, we could try to reland this patch, by providing -cache-line-size=64 for each loop interchange test case. I've updated the patch accordingly.

I did not remove the target datalayout lines for the moment. As Michael pointed out, it represents the pointer size and so on. I actually tried to remove those lines but I hit one test failure in phi-ordering.ll (a SCEV crash in loop cache analysis). As Florian mentioned earlier, there should be some bug within loop cache analysis, and I'll look into it in my next patch as promised. For now I'm thinking we could try to get this patch landed firstly.

Comments are appreciated :)

Hi Bardia, I'm wondering if you have further comments on this most recent version? @bmahjour

bmahjour accepted this revision.Jun 23 2022, 10:20 AM

LGTM. Thanks!

This revision was landed with ongoing or failed builds.Jun 23 2022, 1:35 PM
This revision was automatically updated to reflect the committed changes.

This change causes ubsan failures due to integer overflow, see https://lab.llvm.org/buildbot/#/builders/5/builds/25185

/b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp:702:30: runtime error: signed integer overflow: 6148914691236517209 * 100 cannot be represented in type 'long'
    #0 0x56504b89b41c in llvm::CacheCost::computeLoopCacheCost(llvm::Loop const&, llvm::SmallVector<llvm::SmallVector<std::__1::unique_ptr<llvm::IndexedReference, std::__1::default_delete<llvm::IndexedReference>>, 8u>, 8u> const&) const /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp:702:30
    #1 0x56504b899bdb in llvm::CacheCost::calculateCacheFootprint() /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp:598:28
    #2 0x56504b89a218 in std::__1::__unique_if<llvm::CacheCost>::__unique_single std::__1::make_unique<llvm::CacheCost, llvm::SmallVector<llvm::Loop*, 8u>&, llvm::LoopInfo&, llvm::ScalarEvolution&, llvm::TargetTransformInfo&, llvm::AAResults&, llvm::DependenceInfo&, llvm::Optional<unsigned int>&>(llvm::SmallVector<llvm::Loop*, 8u>&, llvm::LoopInfo&, llvm::ScalarEvolution&, llvm::TargetTransformInfo&, llvm::AAResults&, llvm::DependenceInfo&, llvm::Optional<unsigned int>&) /b/sanitizer-x86_64-linux-fast/build/libcxx_build_ubsan/include/c++/v1/__memory/unique_ptr.h:714:32
    #3 0x56504b899e74 in llvm::CacheCost::getCacheCost(llvm::Loop&, llvm::LoopStandardAnalysisResults&, llvm::DependenceInfo&, llvm::Optional<unsigned int>) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp:583:10
    #4 0x56504b888b74 in llvm::LoopInterchangePass::run(llvm::LoopNest&, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>&, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/Scalar/LoopInterchange.cpp:1773:7
    #5 0x56504bf78ad1 in llvm::detail::PassModel<llvm::LoopNest, llvm::LoopInterchangePass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>::run(llvm::LoopNest&, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>&, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:88:17
    #6 0x56504bffbbb5 in llvm::Optional<llvm::PreservedAnalyses> llvm::PassManager<llvm::Loop, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>::runSinglePass<llvm::LoopNest, std::__1::unique_ptr<llvm::detail::PassConcept<llvm::LoopNest, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>, std::__1::default_delete<llvm::detail::PassConcept<llvm::LoopNest, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>>>>(llvm::LoopNest&, std::__1::unique_ptr<llvm::detail::PassConcept<llvm::LoopNest, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>, std::__1::default_delete<llvm::detail::PassConcept<llvm::LoopNest, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>>>&, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>&, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&, llvm::PassInstrumentation&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/Transforms/Scalar/LoopPassManager.h:398:16
    #7 0x56504bffb033 in llvm::PassManager<llvm::Loop, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>::runWithLoopNestPasses(llvm::Loop&, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>&, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/Scalar/LoopPassManager.cpp:104:16
    #8 0x56504bffad4a in llvm::PassManager<llvm::Loop, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>::run(llvm::Loop&, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>&, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/Scalar/LoopPassManager.cpp:31:32
    #9 0x56504bf28bb1 in llvm::detail::PassModel<llvm::Loop, llvm::PassManager<llvm::Loop, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&>::run(llvm::Loop&, llvm::AnalysisManager<llvm::Loop, llvm::LoopStandardAnalysisResults&>&, llvm::LoopStandardAnalysisResults&, llvm::LPMUpdater&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:88:17
    #10 0x56504bffc3c5 in llvm::FunctionToLoopPassAdaptor::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/Transforms/Scalar/LoopPassManager.cpp:297:22
    #11 0x56504bf66881 in llvm::detail::PassModel<llvm::Function, llvm::FunctionToLoopPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:88:17
    #12 0x56504b190883 in llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManager.h:522:21
    #13 0x565049418591 in llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:88:17
    #14 0x56504b194f17 in llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/lib/IR/PassManager.cpp:127:22
    #15 0x565049418301 in llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManagerInternal.h:88:17
    #16 0x56504b18faa3 in llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/include/llvm/IR/PassManager.h:522:21
    #17 0x565048f05f7c in llvm::runPassPipeline(llvm::StringRef, llvm::Module&, llvm::TargetMachine*, llvm::TargetLibraryInfoImpl*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::ToolOutputFile*, llvm::StringRef, llvm::ArrayRef<llvm::StringRef>, llvm::ArrayRef<llvm::PassPlugin>, llvm::opt_tool::OutputKind, llvm::opt_tool::VerifierKind, bool, bool, bool, bool, bool) /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/opt/NewPMDriver.cpp:496:7
    #18 0x565048f21a18 in main /b/sanitizer-x86_64-linux-fast/build/llvm-project/llvm/tools/opt/opt.cpp:822:12
    #19 0x7fda3e3df09a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2409a) (BuildId: eb6a5dd378d22b1e695984462a799cd4c81cdc22)
    #20 0x565048ed7b79 in _start (/b/sanitizer-x86_64-linux-fast/build/llvm_build_ubsan/bin/opt+0x6368b79)

The cost overflow is a known issue (see https://github.com/llvm/llvm-project/issues/55233). I wonder if specifying a larger cache line size (eg. -cache-line-size=128 or 256 ) would workaround the ubsan failures for now. Otherwise we'd have to fix 55233 before recommitting this, I guess.

Won't that just make the tests pass, but keep potential integer overflows (UB) in the compiler? Sounds suboptimal.

The cost overflow is a known issue (see https://github.com/llvm/llvm-project/issues/55233). I wonder if specifying a larger cache line size (eg. -cache-line-size=128 or 256 ) would workaround the ubsan failures for now. Otherwise we'd have to fix 55233 before recommitting this, I guess.

The test case that fails upstream test, i.e., test/Transforms/LoopInterchange/interchangeable-innerloop-multiple-indvars.ll, is not supposed to result in large values in loop cache analysis calculation. The loop costs associated with those loops should be pretty small, and overflow is not supposed to occur. This is a bit strange, and on my local machine all tests did pass. We may need to dig deeper to find why it fails upstream build bot, nevertheless not being able to reproduce the failure has caused a bit of headache for me.

congzhe added a comment.EditedJun 24 2022, 11:44 AM

The cost overflow is a known issue (see https://github.com/llvm/llvm-project/issues/55233). I wonder if specifying a larger cache line size (eg. -cache-line-size=128 or 256 ) would workaround the ubsan failures for now. Otherwise we'd have to fix 55233 before recommitting this, I guess.

The test case that fails upstream test, i.e., test/Transforms/LoopInterchange/interchangeable-innerloop-multiple-indvars.ll, is not supposed to result in large values in loop cache analysis calculation. The loop costs associated with those loops should be pretty small, and overflow is not supposed to occur. This is a bit strange, and on my local machine all tests did pass. We may need to dig deeper to find why it fails upstream build bot, nevertheless not being able to reproduce the failure has caused a bit of headache for me.

Okay I missed the fact that this is a sanitizer test, it makes more sense to fail the test now since potentially loop cache analylsis could overflow. But I'm still not entirely clear - if loop cache analysis suffers from potential overflow it should have failed the tests in test/LoopCacheAnalysis/ quite early on at the first place. Why did test/LoopCacheAnalysis/ not fail the sanitizer tests before but test/LoopInterchange/ tests are failing now?

The cost overflow is a known issue (see https://github.com/llvm/llvm-project/issues/55233). I wonder if specifying a larger cache line size (eg. -cache-line-size=128 or 256 ) would workaround the ubsan failures for now. Otherwise we'd have to fix 55233 before recommitting this, I guess.

The test case that fails upstream test, i.e., test/Transforms/LoopInterchange/interchangeable-innerloop-multiple-indvars.ll, is not supposed to result in large values in loop cache analysis calculation. The loop costs associated with those loops should be pretty small, and overflow is not supposed to occur. This is a bit strange, and on my local machine all tests did pass. We may need to dig deeper to find why it fails upstream build bot, nevertheless not being able to reproduce the failure has caused a bit of headache for me.

Okay I missed the fact that this is a sanitizer test, it makes more sense to fail the test now since potentially loop cache analylsis could overflow. But I'm still not entirely clear - if loop cache analysis suffers from potential overflow it should have failed the tests in test/LoopCacheAnalysis/ quite early on at the first place. Why did test/LoopCacheAnalysis/ not fail the sanitizer tests before but test/LoopInterchange/ tests are failing now?

I find the problem - it is line 185 in interchangeable-innerloop-multiple-indvars.ll, which is %tobool2 = icmp eq i64 %indvars.add, 0. This actually caused the loop to be infinite since %indvars.add will never be 0. The trip count in this case would be a very large number which caused the overlfow. I've modified this line to be %tobool2 = icmp sle i64 %indvars.add, 0, which resolves the problem. I guess let me fix it and do a sanitizer check again before I re-commit it.

Yep. Ubsan does not detect potential overflows, it detects actual overflows as they happen. The problem is real.

congzhe updated this revision to Diff 440465.EditedJun 27 2022, 9:03 PM

Fixed the IR in interchangeable-innerloop-multiple-indvars.ll, will do another attempt to land.

Update: landed, hopefully will get through this time.

This revision was landed with ongoing or failed builds.Jun 27 2022, 9:09 PM

Hello,

We stumbled upon a case where loop-interchange seems to start hanging and consume more and more memory until it cannot do that any longer with this patch:

opt -passes="loop(loop-interchange,loop-interchange)" bbi-72548.ll -o /dev/null

Hello,

We stumbled upon a case where loop-interchange seems to start hanging and consume more and more memory until it cannot do that any longer with this patch:

opt -passes="loop(loop-interchange,loop-interchange)" bbi-72548.ll -o /dev/null

Hi Mikael, thanks for this. I've quickly looked at it and the problem can be avoided by running opt -passes="loop(loop-interchange),loop(loop-interchange))" bbi-72548.ll -o /dev/null using the new pass manager, or opt -loop-interchange -loop-interchange" bbi-72548.ll -o /dev/null using the legacy pass manager.

My impression is that the root cause is not within this patch but I suspect it is a problem with the loopnest pass pipeline within the new pass manager. This IR is a triply-nested loop, after the first interchange pass, it should still be a triply-nested loop. However, when running opt -passes="loop(loop-interchange,loop-interchange), what is populated into the loopnest data structure in the LoopInterchangePass::run() function at the beginning of the 2nd interchange pass is a 2-level (doubly-nested) loop. This is incorrect and caused the trouble (infinitely looping over the following code in LoopInterchange.cpp).

while (Dep.size() != Level) {
  Dep.push_back('I');
}

Seems like the loopnest pass pipeline does not get the loop correct, when the loop is modified during the pipeline.

When running opt -passes="loop(loop-interchange),loop(loop-interchange))" bbi-72548.ll -o /dev/null or opt -loop-interchange -loop-interchange" bbi-72548.ll -o /dev/null, what is populated into the loopnest data structure in the LoopInterchangePass::run() function at the beginning of the 2nd interchange pass is a 3-level (triply-nested) loop, which is correct.

I'll do more investigations and post updates here.

I wrote an issue about a failed assertion we started seeing with this patch:
https://github.com/llvm/llvm-project/issues/57148

Note: Both the hanging and the new crash have popped up during fuzz testing with nonstandard pipelines.

I wrote an issue about a failed assertion we started seeing with this patch:
https://github.com/llvm/llvm-project/issues/57148

Note: Both the hanging and the new crash have popped up during fuzz testing with nonstandard pipelines.

Thanks, I've posted D132055 to fix pr57148.

Hello,

We stumbled upon a case where loop-interchange seems to start hanging and consume more and more memory until it cannot do that any longer with this patch:

opt -passes="loop(loop-interchange,loop-interchange)" bbi-72548.ll -o /dev/null

Hi Mikael, thanks for this. I've quickly looked at it and the problem can be avoided by running opt -passes="loop(loop-interchange),loop(loop-interchange))" bbi-72548.ll -o /dev/null using the new pass manager, or opt -loop-interchange -loop-interchange" bbi-72548.ll -o /dev/null using the legacy pass manager.

My impression is that the root cause is not within this patch but I suspect it is a problem with the loopnest pass pipeline within the new pass manager. This IR is a triply-nested loop, after the first interchange pass, it should still be a triply-nested loop. However, when running opt -passes="loop(loop-interchange,loop-interchange), what is populated into the loopnest data structure in the LoopInterchangePass::run() function at the beginning of the 2nd interchange pass is a 2-level (doubly-nested) loop. This is incorrect and caused the trouble (infinitely looping over the following code in LoopInterchange.cpp).

while (Dep.size() != Level) {
  Dep.push_back('I');
}

Seems like the loopnest pass pipeline does not get the loop correct, when the loop is modified during the pipeline.

When running opt -passes="loop(loop-interchange),loop(loop-interchange))" bbi-72548.ll -o /dev/null or opt -loop-interchange -loop-interchange" bbi-72548.ll -o /dev/null, what is populated into the loopnest data structure in the LoopInterchangePass::run() function at the beginning of the 2nd interchange pass is a 3-level (triply-nested) loop, which is correct.

I'll do more investigations and post updates here.

I've posted D132199 to fix the hanging problem.