Page MenuHomePhabricator

[LoopIdiomRecognizePass] Options to disable part or the entire Loop Idiom Recognize Pass
ClosedPublic

Authored by anhtuyen on Aug 19 2020, 9:29 PM.

Details

Summary

Loop Idiom Recognize Pass (LIRP) attempts to transform loops with subscripted arrays into memcpy/memset function calls. In some particular situation, this transformation introduces negative impacts.

This patch is to provide the user with options to disable part or the entire Loop Idiom Recognize Pass. The default behavior stays unchanged: no part of LIRP is disabled by default. The options will enable users to disable a particular part of the transformation, while he/she can still enjoy the benefit brought about by the rest of LIRP.

Diff Detail

Event Timeline

anhtuyen created this revision.Aug 19 2020, 9:29 PM
anhtuyen requested review of this revision.Aug 19 2020, 9:29 PM
Eugene.Zelenko added inline comments.Aug 19 2020, 9:50 PM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
239

Unnecessary empty line.

anhtuyen updated this revision to Diff 286710.Aug 19 2020, 10:20 PM

Remove an empty line based on reviewer's comment.

anhtuyen marked an inline comment as done.Aug 19 2020, 10:22 PM
anhtuyen added inline comments.
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
239

Thanks, the blank line has been removed.

Why is this the correct course of action?
For example, if the IR already had memcpy/memset, the DA will also be pessimized,
so it really seems like a workaround, not a fix.
Why not just enhance DA?

Why is this the correct course of action?
For example, if the IR already had memcpy/memset, the DA will also be pessimized,
so it really seems like a workaround, not a fix.
Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
111–114

Please prefer single-line comments in such cases.

239

You can omit .getValue() here and below.

fhahn added a subscriber: fhahn.Aug 20 2020, 1:14 AM

I agree that the justification provided seems somewhat inadequate - using memset/memcpy as a canonical form and teaching DA to handle it seems quite reasonable. However, there are libraries that are highly tuned wrt. when to call these functions and when not to. So it seems to me perfectly reasonable to provide an option to disable this for such uses. For example, a function may perform a memcpy/memset operation on data that is known to only be called only from range-limited sites. The compiler (without PGO+LTO which are often not an option) cannot know that it should expand such calls in the back end.

However, I am not sure I agree with this being the best approach for the implementation. Presumably (now or in the future) there may be other places in this pass that produce memset/memcpy. Also, we will unnecessarily collect stores in in collectStores(). So it seems to me like it might be more advantageous to have these options prevent isLegalStore() from returning the corresponding store type.

rzurob added a subscriber: rzurob.Aug 20 2020, 4:14 AM
rzurob added inline comments.
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
117

Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they should specify --disable-loop-idiom=all. But what if another idiom is added to the LoopIdiomRecognize pass in the future? --disable-loop-idiom=all will presumably disable more than what the user asked for.

Would it be possible to handle multiple --disable-loop-idiom=<something> options cumulatively? If not, does it make sense to just provide one option to disable the memcpy+memset transformation (i.e. both or neither) instead of this option with all/none/memcpy/memset fine-grain control?

fhahn added a comment.Aug 20 2020, 4:22 AM

I agree that the justification provided seems somewhat inadequate - using memset/memcpy as a canonical form and teaching DA to handle it seems quite reasonable. However, there are libraries that are highly tuned wrt. when to call these functions and when not to. So it seems to me perfectly reasonable to provide an option to disable this for such uses. For example, a function may perform a memcpy/memset operation on data that is known to only be called only from range-limited sites. The compiler (without PGO+LTO which are often not an option) cannot know that it should expand such calls in the back end.

The issue you raise seems to be a cost-modeling issue, right? Ideally the compiler would know whether it is faster to use a loop or a memcpy. IIUC you are worried about the cases where the compiler replaces a loop with a memset, but the loop would be faster? Is there a way to improve the cost-modeling? IMO adding options like this potentially leads to papering over cost-modeling issues, rather than addressing the underlying issue. Also, such options tend to now work well together with LTO.

Presumably (now or in the future) there may be other places in this pass that produce memset/memcpy. Also, we will unnecessarily collect stores in in collectStores(). So it seems to me like it might be more advantageous to have these options prevent isLegalStore() from returning the corresponding store type.

I agree with your observation. I will change the location (if we end up with disabling memset/memcpy).

Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

I agree that DA can be improved, but I hope that the example below from @nemanjai about libraries, which are highly tuned with respect to when/whether to call memset/memcpy functions provides another reason for this proposed change.

Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

I agree that DA can be improved, but I hope that the example below from @nemanjai about libraries, which are highly tuned with respect to when/whether to call memset/memcpy functions provides another reason for this proposed change.

I'm not really sold on this. We have similar transform in InstCombine.
If the original problem is with DA, then that is what should be improved.

If this transform is a pessimization for some code, it would be good to actually see that code.

anhtuyen added inline comments.Aug 20 2020, 7:21 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
111–114

Thank you! I will change the style of comments.

117

You raised a good point, thank you @rzurob. Let me re-think about that.

239

Thank you @dfukalov . I will remove .getValue() .

bmahjour added a comment.EditedAug 20 2020, 7:21 AM

I agree with that, it seems to be better to improve DA. Is it feasible?

The theory of data dependence analysis relies on presence of subscripts in array references to be able to produce accurate results. I don't see how we can "improve DA" to address memset/memcpies short of turning them back into loop nests before applying the dependence tests. To do that the loop has to either be materialized before the DA analysis pass is run, or somehow SCEV expressions representing the implied subscripts be synthesized out of thin air. The former must be achieved by a transformation pass, so we would have to turn memset/memcpys into loop nests as soon as possible. For memset/memcpy calls generated by the loop idiom pass, the ideal place for that transformation would be immediately after loop idiom itself, which would have the same effect as preventing loop idiom from creating such loops in the first place when it knows they are not profitable. I don't know of any possible way to do the latter.

I agree with @fhahn that this is more of a cost-modeling issue. I think the cost-modeling would have to rely heavily on loop tripcount data which, in the general case, is only available through PGO, so an option to disable it for users who don't want to use PGO makes sense to me.

Another way that loop idiom can hurt performance is by creating imperfect loop nests. Certain transformations (such as loop interchange and unroll-and-jam) are more difficult (sometimes impossible) to do when loop nests are not perfect.

anhtuyen added a comment.EditedAug 20 2020, 7:57 AM

Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

I agree that DA can be improved, but I hope that the example below from @nemanjai about libraries, which are highly tuned with respect to when/whether to call memset/memcpy functions provides another reason for this proposed change.

I'm not really sold on this. We have similar transform in InstCombine.
If the original problem is with DA, then that is what should be improved.

If this transform is a pessimization for some code, it would be good to actually see that code.

I hope you find the detailed comment below by @bmahjour (thanks, @bmahjour) a reasonable ground for this change.

The theory of data dependence analysis relies on presence of subscripts in array references to be able to produce accurate results. I don't see how we can "improve DA" to address memset/memcpies short of turning them back into loop nests before applying the dependence tests. To do that the loop has to either be materialized before the DA analysis pass is run, or somehow SCEV expressions representing the implied subscripts be synthesized out of thin air. The former must be achieved by a transformation pass, so we would have to turn memset/memcpys into loop nests as soon as possible. For memset/memcpy calls generated by the loop idiom pass, the ideal place for that transformation would be immediately after loop idiom itself, which would have the same effect as preventing loop idiom from creating such loops in the first place when it knows they are not profitable. I don't know of any possible way to do the latter.

I agree with @fhahn that this is more of a cost-modeling issue. I think the cost-modeling would have to rely heavily on loop tripcount data which, in the general case, is only available through PGO, so an option to disable it for users who don't want to use PGO makes sense to me.

fhahn added a comment.Aug 20 2020, 8:33 AM

I agree with that, it seems to be better to improve DA. Is it feasible?

The theory of data dependence analysis relies on presence of subscripts in array references to be able to produce accurate results. I don't see how we can "improve DA" to address memset/memcpies short of turning them back into loop nests before applying the dependence tests. To do that the loop has to either be materialized before the DA analysis pass is run, or somehow SCEV expressions representing the implied subscripts be synthesized out of thin air. The former must be achieved by a transformation pass, so we would have to turn memset/memcpys into loop nests as soon as possible. For memset/memcpy calls generated by the loop idiom pass, the ideal place for that transformation would be immediately after loop idiom itself, which would have the same effect as preventing loop idiom from creating such loops in the first place when it knows they are not profitable. I don't know of any possible way to do the latter.

IIUC LoopIdiom will effectively remove a loop and replace it with a memset/memcopy. So we should have the same information, just in different forms: loop that writes successive memory locations or a single call that we know writes to the same locations. I think it would be good to have a concrete motivating example that highlights what exactly goes wrong.

I agree with @fhahn that this is more of a cost-modeling issue. I think the cost-modeling would have to rely heavily on loop tripcount data which, in the general case, is only available through PGO, so an option to disable it for users who don't want to use PGO makes sense to me.

I don't think the pass creating memset/memcpy that are not profitable is a problem only for 'highly tuned libraries'. It is a problem for any code. I would argue if we cannot prove that it is likely to be profitable to optimize. So if we do not know the trip count (or do not have a good estimate), we should not create memsets/memcpys. Again, a concrete motivating example would be helpful.

... a concrete motivating example would be helpful.

I will create an example which I can share it here (without violating the non-disclosure terms on my side).

I agree that the justification provided seems somewhat inadequate - using memset/memcpy as a canonical form and teaching DA to handle it seems quite reasonable. However, there are libraries that are highly tuned wrt. when to call these functions and when not to. So it seems to me perfectly reasonable to provide an option to disable this for such uses. For example, a function may perform a memcpy/memset operation on data that is known to only be called only from range-limited sites. The compiler (without PGO+LTO which are often not an option) cannot know that it should expand such calls in the back end.

The issue you raise seems to be a cost-modeling issue, right? Ideally the compiler would know whether it is faster to use a loop or a memcpy. IIUC you are worried about the cases where the compiler replaces a loop with a memset, but the loop would be faster? Is there a way to improve the cost-modeling? IMO adding options like this potentially leads to papering over cost-modeling issues, rather than addressing the underlying issue. Also, such options tend to now work well together with LTO.

I suppose we can certainly treat this (at least in part) as a cost modeling issue - in that AFAICT, LIR does not do any cost modeling. If it is able to transform a loop into a memcpy, it does. If memcpy is the canonical form, this is OK as a canonicalization pass. However, given that we don't really get to expand these things into loops later on, it is a problem.

For memcmp, we added code (3a7578c6589b910f9a04bae7f7f121dfe3281578) to expand them that ultimately got moved out into a separate pass (063bed9baff63a0d716a5c9533cf2601dafbe0e0). I don't really remember the details of how much handling it has for non-constant lengths but presumably it does (or at least we have an obvious place to add it).
However, I do not think we have a similar pass for memcpy and memset. When we get into the SDAG, we have TargetLowering::findOptimalMemOpLowering() which can only handle constant lengths and presumably cannot expand into loops (being basic block local and all).

When we combine all of the above, code like this ends up with a call to memcpy:

void smallcp(unsigned *__restrict a, unsigned *__restrict b, unsigned len) {
  for (unsigned i = 0; i < (len % 4); i++)
    a[i] = b[i];
}

And my comment regarding "highly tuned libraries" was really meant to suggest a situation where for example, the rem is on the caller side in a different module so even a perfect cost model wouldn't help (well without LTO anyway).

To avoid any issue with NDA, I wrote a simple test as follows. This test program tries to prove the fact that, the impact of Loop Idiom Recognize Pass (LIGP)’s replacing store with memset is not always a positive one.

  1. Compiler work.c to an IR file called work.ll. We will use this IR file work.ll for both LIRP and LIRP --disable-loop-idiom=memset
  2. Call opt with -loop-idiom to produce an IR file containing memset instruction
opt -basic-aa -loop-idiom -S work.ll -o work.yes.ll
  1. Call opt with -loop-idiom --disable-loop-idiom=memset to produce an IR file without memset instruction
opt -basic-aa -loop-idiom --disable-loop-idiom=memset -S work.ll -o work.no.ll
  1. Inspect to make sure LIRP did replace store with memset in work.yes.ll, but not in work.no.ll
  2. Compile the test.c, and link it with the IR from (2) and then the IR from (3).
clang++ -c test.c
clang++ test.o work.yes.ll -o yes
clang++ test.o work.no.ll -o no
  1. Run both the executables on a quiet machine. On my performance machine, times spent are:

With memset: Time elapsed: 1.4215
Without memset: Time elapsed: 1.3611

test.c

#include <stdio.h>
#include <time.h>

int work(int A[], int sizeI, int sizeL);

int main() {
  int A[3] = {1, 2, 3};
  int res = 1;
  clock_t begin = clock();
  res = work(A, 9999, 3);
  clock_t end = clock();
  double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

  printf("Time elapsed: %4.4f\n", time_spent);
  return res;
}

work.c

int work(int A[], int sizeI, int arraySize) {

    for (int i = 0; i < sizeI; ++i)
      for (int j = 0; j < sizeI; ++j)
        for (int k = 0; k < arraySize; ++k)
          A[k] = 0;

  return A[arraySize - 1];
}

IR before calling opt with -loop-idiom

; ModuleID = './work.ll'
source_filename = "work.c"
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

; Function Attrs: noinline nounwind
define dso_local signext i32 @_Z4workPiii(i32* %A, i32 signext %sizeI, i32 signext %arraySize)  #0 {
entry:
  %cmp6 = icmp slt i32 0, %sizeI
  br i1 %cmp6, label %for.body.preheader, label %for.end12

for.body.preheader:                               ; preds = %entry
  br label %for.body

for.body:                                         ; preds = %for.body.preheader, %for.inc10
  %i.07 = phi i32 [ %inc11, %for.inc10 ], [ 0, %for.body.preheader ]
  %cmp23 = icmp slt i32 0, %sizeI
  br i1 %cmp23, label %for.body3.preheader, label %for.inc10

for.body3.preheader:                              ; preds = %for.body
  br label %for.body3

for.body3:                                        ; preds = %for.body3.preheader, %for.inc7
  %j.04 = phi i32 [ %inc8, %for.inc7 ], [ 0, %for.body3.preheader ]
  %cmp51 = icmp slt i32 0, %arraySize
  br i1 %cmp51, label %for.body6.preheader, label %for.inc7

for.body6.preheader:                              ; preds = %for.body3
  br label %for.body6

for.body6:                                        ; preds = %for.body6.preheader, %for.body6
  %k.02 = phi i32 [ %inc, %for.body6 ], [ 0, %for.body6.preheader ]
  %idxprom = sext i32 %k.02 to i64
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nsw i32 %k.02, 1
  %cmp5 = icmp slt i32 %inc, %arraySize
  br i1 %cmp5, label %for.body6, label %for.inc7.loopexit

for.inc7.loopexit:                                ; preds = %for.body6
  br label %for.inc7

for.inc7:                                         ; preds = %for.inc7.loopexit, %for.body3
  %inc8 = add nsw i32 %j.04, 1
  %cmp2 = icmp slt i32 %inc8, %sizeI
  br i1 %cmp2, label %for.body3, label %for.inc10.loopexit

for.inc10.loopexit:                               ; preds = %for.inc7
  br label %for.inc10

for.inc10:                                        ; preds = %for.inc10.loopexit, %for.body
  %inc11 = add nsw i32 %i.07, 1
  %cmp = icmp slt i32 %inc11, %sizeI
  br i1 %cmp, label %for.body, label %for.end12.loopexit

for.end12.loopexit:                               ; preds = %for.inc10
  br label %for.end12

for.end12:                                        ; preds = %for.end12.loopexit, %entry
  %sub = sub nsw i32 %arraySize, 1
  %idxprom13 = sext i32 %sub to i64
  %arrayidx14 = getelementptr inbounds i32, i32* %A, i64 %idxprom13
  %0 = load i32, i32* %arrayidx14, align 4
  ret i32 %0
}

attributes #0 = { noinline nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="pwr9" "target-features"="+altivec,+bpermd,+crypto,+direct-move,+extdiv,+htm,+power8-vector,+power9-vector,+vsx,-spe" "unsafe-fp-math"="false" "use-soft-float"="false"
 }

Extract from IR with memset

for.body6.preheader:                              ; preds = %for.body3
  call void @llvm.memset.p0i8.i64(i8* align 4 %A1, i8 0, i64 %1, i1 false)
  br label %for.body6

for.body6:                                        ; preds = %for.body6, %for.body6.preheader
  %k.02 = phi i32 [ %inc, %for.body6 ], [ 0, %for.body6.preheader ]
  %idxprom = sext i32 %k.02 to i64
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %idxprom
  %inc = add nsw i32 %k.02, 1
  %cmp5 = icmp slt i32 %inc, %arraySize
  br i1 %cmp5, label %for.body6, label %for.inc7.loopexit

Extract from IR without memset

for.body6.preheader:                              ; preds = %for.body3
  br label %for.body6

for.body6:                                        ; preds = %for.body6, %for.body6.preheader
  %k.02 = phi i32 [ %inc, %for.body6 ], [ 0, %for.body6.preheader ]
  %idxprom = sext i32 %k.02 to i64
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nsw i32 %k.02, 1
  %cmp5 = icmp slt i32 %inc, %arraySize
  br i1 %cmp5, label %for.body6, label %for.inc7.loopexit
lebedev.ri requested changes to this revision.Aug 21 2020, 1:58 PM

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

This revision now requires changes to proceed.Aug 21 2020, 1:58 PM

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
117

Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they should specify --disable-loop-idiom=all. But what if another idiom is added to the LoopIdiomRecognize pass in the future? --disable-loop-idiom=all will presumably disable more than what the user asked for.

Would it be possible to handle multiple --disable-loop-idiom=<something> options cumulatively? If not, does it make sense to just provide one option to disable the memcpy+memset transformation (i.e. both or neither) instead of this option with all/none/memcpy/memset fine-grain control?

I agree and also think is overkill to provide options to disable only part of the transformation, at least at this point. IMO is sufficient to start with a all or nothing option to disable the entire transformation (of course the transformation should continue to be enabled by default).

etiotto accepted this revision.EditedAug 24 2020, 7:01 AM

Transforming a loop into a memset or memcpy is not *always* profitable (it depends on how many elements are initialized/copied and on the efficiency of the target architecture implementation for those libraries) but is often better than a loop. The low level optimizer should change short memset/memcpy back into a sequence of assignments, IMO this should not be done in opt because the exact length for which memset is less profitable than individual assignment is a function of the target architecture. As for this PR, adding an option to disable the optimization is a good thing as it provides more flexibility to users that for whatever reason do not want the transformation to run (and is also handy for compiler developers when debugging code). And more flexibility is a good thing.

Here are a couple of programs to test the performance of a simple memset vs a simple initialization loop. On my system (PPC) even this short init. loop is slower than the memset.

% cat loop.c
#include <string.h>
int main() {
  int A[N];
  for (int n=0; n<STEPS; ++n)
    for(int i=0;i<N;++i)
      A[i] = 0;
  return A[0];
}

% gcc -O0 loop.c -DN=10 -DSTEPS=1000000; time ./a.out
./a.out 0.10s user 0.00s system 99% cpu 0.099 total

% cat memset.c
#include <string.h>
int main() {
  int A[N];
  for (int n=0; n<STEPS; ++n)
    memset(A, 0, N * sizeof(int));
  return A[0];
}

% gcc -O0 memset.c -DN=10 -DSTEPS=1000000; time ./a.out
./a.out 0.02s user 0.00s system 99% cpu 0.022 total

anhtuyen updated this revision to Diff 287413.Aug 24 2020, 8:56 AM

Address comments by the reviewers.

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

It is not easy to provide a test case showing that the DA is confused after memset/memcpy function calls are used. For the purpose of this patch, Ettore (thanks @etiotto) and I have shown a few simple testcases where the performance is negatively impacted by the existence of memset (which was inserted by LIRP). I hope you consider it sufficient for the patch to be adopted.

fhahn added a comment.Aug 24 2020, 9:07 AM

Thanks for sharing the examples, IIUC they illustrate the issue with the cost modeling, which is the main motivation for the patch, rather than DA issues mentioned initially?

Transforming a loop into a memset or memcpy is not always profitable (it depends on how many elements are initialized/copied and on the efficiency of the target architecture implementation for those libraries). The low level optimizer should change short memset/memcpy back into a sequence of assignments, IMO this should not be done in opt because the exact length for which memset is less profitable than individual assignment is a function of the target architecture. As for this PR, adding an option to disable the optimization is a good thing as it provides more flexibility to users that for whatever reason do not want the transformation to run (and is also handy for compiler developers when debugging code). And more flexibility is a good thing.

The option to disable the pass seems indeed convenient in the short term, but I am not sure if it is really helpful in the long run. It does not address the underlying issue (bad/non-existing cost model) and it means we generates sub-optimal code in some cases for the vast majority of users (which will neither know nor set this special flag). Also, will the option work with LTO? I think Clang usually does not pass -mllvm options to the linker/LTO plugin.

I think such an option is fine as a temporary stop-gap solution, but the goal should be to fix the underlying issue IMO. Otherwise I am worried that the option reduces the incentive to fix the cost-modeling (which should be a fix-able issue).

It looks like there are some un-addressed comments with respect to the exact way to disable the transformation. If there's a single option to disable the whole pass, it might be worth considering to not add it to the pipeline, rather than bailing out in the pass. Also, it would be good to update the description of the patch with the actual motivation.

anhtuyen added inline comments.Aug 24 2020, 9:08 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
117

Thank you very much Rafik @rzurob and Ettore @etiotto for your opinion. I have switched from cl::opt to cl::bits, which enabled us to handle multiple --disable-loop-idiom=<something> options cumulatively. I hope that this approach addressed your concerns.

anhtuyen edited the summary of this revision. (Show Details)Aug 24 2020, 9:13 AM

I removed the details about DA being confused by the insertion of memset/memcpy by LIRP in the description.

lebedev.ri requested changes to this revision.Aug 24 2020, 9:31 AM
This revision now requires changes to proceed.Aug 24 2020, 9:31 AM
anhtuyen added a comment.EditedAug 24 2020, 9:52 AM

Hi @lebedev.ri
I saw you had again requested changes to this revision, but I could not find what changes that you thought should be required. Can you please be specific?

Hi @lebedev.ri
I saw you had again requested changes to this revision, but I could not find what changes that you thought should be required. Can you please be specific?

I'd like to better understand the roadmap here.
Initially the patch was submitted as a bandaid for DA, but then it was, taken over?, by general performance concerns.

  1. Is there or is there no a DA problem for these intrinsics?
    1. Is there a bugreport?
    2. Is there a testcase?
    3. If the patch proceeds as-is, what's long-term plan? Is anyone planning on addressing the underlying problem?
  2. For the transform itself
    1. Is there a bugreport
    2. Is there a testcase? (i can see it in comments, but it should be in bugreport)
    3. Is anyone planning on fixing this in more long-term way, with costmodelling and whatnot?

In none of those cases is such a flag a reasonable long-term solution.

@lebedev.ri

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

Here's a test that shows a memset created by loop-idiom can prevent loop interchange because the dependence vectors are pessimized:

; ModuleID = 'interchange.ll'
source_filename = "interchange.c"
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

; Function Attrs: nounwind
define dso_local void @foo([1024 x i32]* %a, [1024 x i32]* %b, [1024 x [1024 x i32]]* %c, i32 signext %n) #0 {
entry:
  br label %for.body

for.body:                                         ; preds = %entry, %for.inc16
  %indvars.iv7 = phi i64 [ 0, %entry ], [ %indvars.iv.next8, %for.inc16 ]
  br label %for.body4

for.body4:                                        ; preds = %for.body, %for.inc13
  %indvars.iv4 = phi i64 [ 0, %for.body ], [ %indvars.iv.next5, %for.inc13 ]
  br label %for.body8

for.body8:                                        ; preds = %for.body4, %for.body8
  %indvars.iv = phi i64 [ 0, %for.body4 ], [ %indvars.iv.next, %for.body8 ]
  %arrayidx12 = getelementptr inbounds [1024 x [1024 x i32]], [1024 x [1024 x i32]]* %c, i64 %indvars.iv4, i64 %indvars.iv7, i64 %indvars.iv
  store i32 0, i32* %arrayidx12, align 4, !tbaa !2
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 1024
  br i1 %exitcond, label %for.body8, label %for.inc13

for.inc13:                                        ; preds = %for.body8
  %indvars.iv.next5 = add nuw nsw i64 %indvars.iv4, 1
  %exitcond6 = icmp ne i64 %indvars.iv.next5, 1024
  br i1 %exitcond6, label %for.body4, label %for.inc16

for.inc16:                                        ; preds = %for.inc13
  %indvars.iv.next8 = add nuw nsw i64 %indvars.iv7, 1
  %exitcond9 = icmp ne i64 %indvars.iv.next8, 1024
  br i1 %exitcond9, label %for.body, label %for.end18

for.end18:                                        ; preds = %for.inc16
  ret void
}

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1

attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="ppc64le" "target-features"="+altivec,+bpermd,+crypto,+direct-move,+extdiv,+htm,+power8-vector,+vsx,-power9-vector,-spe" "unsafe-fp-math"="true" "use-soft-float"="false" }
attributes #1 = { argmemonly nounwind willreturn }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"XL C/C++ for Linux on Power, (IBM Internal Development Branch), clang version 12.0.0 (git@github.ibm.com:compiler/llvm-project.git 4285385d877abd46f7a8c81d52e7aeff7a92b6a1)"}
!2 = !{!3, !3, i64 0}
!3 = !{!"int", !4, i64 0}
!4 = !{!"omnipotent char", !5, i64 0}
!5 = !{!"Simple C/C++ TBAA"}

without loop-idiom:

> opt interchange.simp.ll -S -basic-aa -loop-interchange -o out.ll -stats -debug-only=loop-interchange -loop-interchange-threshold=-2 2>&1 | grep -E -- 'legality|loop-interchange'
2 loop-interchange - Number of loops interchanged

with loop-idiom:

> opt interchange.simp.ll -S -basic-aa -loop-idiom -loop-interchange -o out.ll -stats -debug-only=loop-interchange -loop-interchange-threshold=-2 2>&1 | grep -E -- 'legality|loop-interchange'
Not interchanging loops. Cannot prove legality.

I think such an option is fine as a temporary stop-gap solution, but the goal should be to fix the underlying issue IMO.

The long term solution would be to make loop-idiom consider compile-time or PGO loop trip count data when deciding to transform loops. For non-constant non-PGO cases the cost modeling would be a heuristic at best, in which case the option would still be useful.

Otherwise I am worried that the option reduces the incentive to fix the cost-modeling (which should be a fix-able issue).

On the other hand, having an option makes it easier to identify opportunities and collect performance data for developing a cost model and tuning heuristics.

Thanks @bmahjour for uploading a DA test case. I had difficulty coming up with a similar test myself.
To @lebedev.ri : I think the comments from @bmahjour might have answered most of your questions. If not, please list them here and I will try to address them. If you insist in having the testcases in bugreport, I will do that. It might take some time though, since I am still waiting for my password to be reset via the webtools (contacted bugs-admin).

anhtuyen updated this revision to Diff 288602.Aug 28 2020, 7:18 AM

Update with cl::location and remove the extra - from the testcase (eg from --disable... to -disable).

nikic resigned from this revision.Aug 28 2020, 9:08 AM

Hello @lebedev.ri
Bardia and I have provided the answer to your concerns, but we have not heard from you since last week. Please let me know what other changes, if any, that you still need to see.

lebedev.ri resigned from this revision.Aug 31 2020, 7:02 AM
This revision is now accepted and ready to land.Aug 31 2020, 7:02 AM
fhahn added a comment.Sep 1 2020, 8:17 AM

I think such an option is fine as a temporary stop-gap solution, but the goal should be to fix the underlying issue IMO.

The long term solution would be to make loop-idiom consider compile-time or PGO loop trip count data when deciding to transform loops. For non-constant non-PGO cases the cost modeling would be a heuristic at best, in which case the option would still be useful.

Otherwise I am worried that the option reduces the incentive to fix the cost-modeling (which should be a fix-able issue).

On the other hand, having an option makes it easier to identify opportunities and collect performance data for developing a cost model and tuning heuristics.

Sure, I am looking forward to patches improving the cost-modeling.