This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
5/10
LoopIdiomRecognize.cpp
-
test/Transforms/LoopIdiom/
-
Transforms/
-
LoopIdiom/
-
disable-options.ll

Differential D86262

[LoopIdiomRecognizePass] Options to disable part or the entire Loop Idiom Recognize Pass
ClosedPublic

Authored by anhtuyen on Aug 19 2020, 9:29 PM.

Download Raw Diff

Details

Reviewers

bmahjour
nemanjai
dfukalov
rupprecht
serge-sans-paille
etiotto
nikic
lebedev.ri

Commits

rG68717acb24e5: [LoopIdiomRecognizePass] Options to disable part or the entire Loop Idiom…

Summary

Loop Idiom Recognize Pass (LIRP) attempts to transform loops with subscripted arrays into memcpy/memset function calls. In some particular situation, this transformation introduces negative impacts.

This patch is to provide the user with options to disable part or the entire Loop Idiom Recognize Pass. The default behavior stays unchanged: no part of LIRP is disabled by default. The options will enable users to disable a particular part of the transformation, while he/she can still enjoy the benefit brought about by the rest of LIRP.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

anhtuyen created this revision.Aug 19 2020, 9:29 PM

Herald added subscribers: llvm-commits, • wuzish, hiraditya. · View Herald TranscriptAug 19 2020, 9:29 PM

anhtuyen requested review of this revision.Aug 19 2020, 9:29 PM

Eugene.Zelenko added inline comments.Aug 19 2020, 9:50 PM

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
239	Unnecessary empty line.

Eugene.Zelenko removed a reviewer: Eugene.Zelenko.Aug 19 2020, 9:51 PM

Harbormaster completed remote builds in B68980: Diff 286707.Aug 19 2020, 10:20 PM

Remove an empty line based on reviewer's comment.

anhtuyen marked an inline comment as done.Aug 19 2020, 10:22 PM

anhtuyen added inline comments.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
239	Thanks, the blank line has been removed.

Harbormaster completed remote builds in B68982: Diff 286710.Aug 19 2020, 11:07 PM

Why is this the correct course of action?
For example, if the IR already had memcpy/memset, the DA will also be pessimized,
so it really seems like a workaround, not a fix.
Why not just enhance DA?

In D86262#2227676, @lebedev.ri wrote:

Why is this the correct course of action?
For example, if the IR already had memcpy/memset, the DA will also be pessimized,
so it really seems like a workaround, not a fix.
Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
111–114	Please prefer single-line comments in such cases.
239	You can omit .getValue() here and below.

fhahn added a subscriber: fhahn.Aug 20 2020, 1:14 AM

I agree that the justification provided seems somewhat inadequate - using memset/memcpy as a canonical form and teaching DA to handle it seems quite reasonable. However, there are libraries that are highly tuned wrt. when to call these functions and when not to. So it seems to me perfectly reasonable to provide an option to disable this for such uses. For example, a function may perform a memcpy/memset operation on data that is known to only be called only from range-limited sites. The compiler (without PGO+LTO which are often not an option) cannot know that it should expand such calls in the back end.

However, I am not sure I agree with this being the best approach for the implementation. Presumably (now or in the future) there may be other places in this pass that produce memset/memcpy. Also, we will unnecessarily collect stores in in collectStores(). So it seems to me like it might be more advantageous to have these options prevent isLegalStore() from returning the corresponding store type.

rzurob added a subscriber: rzurob.Aug 20 2020, 4:14 AM

rzurob added inline comments.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
117	Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they should specify `--disable-loop-idiom=all`. But what if another idiom is added to the LoopIdiomRecognize pass in the future? `--disable-loop-idiom=all` will presumably disable more than what the user asked for. Would it be possible to handle multiple `--disable-loop-idiom=<something>` options cumulatively? If not, does it make sense to just provide one option to disable the memcpy+memset transformation (i.e. both or neither) instead of this option with all/none/memcpy/memset fine-grain control?

In D86262#2228123, @nemanjai wrote:

I agree that the justification provided seems somewhat inadequate - using memset/memcpy as a canonical form and teaching DA to handle it seems quite reasonable. However, there are libraries that are highly tuned wrt. when to call these functions and when not to. So it seems to me perfectly reasonable to provide an option to disable this for such uses. For example, a function may perform a memcpy/memset operation on data that is known to only be called only from range-limited sites. The compiler (without PGO+LTO which are often not an option) cannot know that it should expand such calls in the back end.

The issue you raise seems to be a cost-modeling issue, right? Ideally the compiler would know whether it is faster to use a loop or a memcpy. IIUC you are worried about the cases where the compiler replaces a loop with a memset, but the loop would be faster? Is there a way to improve the cost-modeling? IMO adding options like this potentially leads to papering over cost-modeling issues, rather than addressing the underlying issue. Also, such options tend to now work well together with LTO.

In D86262#2228123, @nemanjai wrote:

Presumably (now or in the future) there may be other places in this pass that produce memset/memcpy. Also, we will unnecessarily collect stores in in collectStores(). So it seems to me like it might be more advantageous to have these options prevent isLegalStore() from returning the corresponding store type.

I agree with your observation. I will change the location (if we end up with disabling memset/memcpy).

In D86262#2227781, @dfukalov wrote:

In D86262#2227676, @lebedev.ri wrote:

Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

I agree that DA can be improved, but I hope that the example below from @nemanjai about libraries, which are highly tuned with respect to when/whether to call memset/memcpy functions provides another reason for this proposed change.

In D86262#2228420, @anhtuyen wrote:

In D86262#2227781, @dfukalov wrote:

In D86262#2227676, @lebedev.ri wrote:

Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

I agree that DA can be improved, but I hope that the example below from @nemanjai about libraries, which are highly tuned with respect to when/whether to call memset/memcpy functions provides another reason for this proposed change.

I'm not really sold on this. We have similar transform in InstCombine.
If the original problem is with DA, then that is what should be improved.

If this transform is a pessimization for some code, it would be good to actually see that code.

anhtuyen added inline comments.Aug 20 2020, 7:21 AM

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
111–114	Thank you! I will change the style of comments.
117	You raised a good point, thank you @rzurob. Let me re-think about that.
239	Thank you @dfukalov . I will remove .getValue() .

I agree with that, it seems to be better to improve DA. Is it feasible?

The theory of data dependence analysis relies on presence of subscripts in array references to be able to produce accurate results. I don't see how we can "improve DA" to address memset/memcpies short of turning them back into loop nests before applying the dependence tests. To do that the loop has to either be materialized before the DA analysis pass is run, or somehow SCEV expressions representing the implied subscripts be synthesized out of thin air. The former must be achieved by a transformation pass, so we would have to turn memset/memcpys into loop nests as soon as possible. For memset/memcpy calls generated by the loop idiom pass, the ideal place for that transformation would be immediately after loop idiom itself, which would have the same effect as preventing loop idiom from creating such loops in the first place when it knows they are not profitable. I don't know of any possible way to do the latter.

I agree with @fhahn that this is more of a cost-modeling issue. I think the cost-modeling would have to rely heavily on loop tripcount data which, in the general case, is only available through PGO, so an option to disable it for users who don't want to use PGO makes sense to me.

Another way that loop idiom can hurt performance is by creating imperfect loop nests. Certain transformations (such as loop interchange and unroll-and-jam) are more difficult (sometimes impossible) to do when loop nests are not perfect.

In D86262#2228451, @lebedev.ri wrote:

In D86262#2228420, @anhtuyen wrote:

In D86262#2227781, @dfukalov wrote:

In D86262#2227676, @lebedev.ri wrote:

Why not just enhance DA?

I agree with that, it seems to be better to improve DA. Is it feasible?

I agree that DA can be improved, but I hope that the example below from @nemanjai about libraries, which are highly tuned with respect to when/whether to call memset/memcpy functions provides another reason for this proposed change.

I'm not really sold on this. We have similar transform in InstCombine.
If the original problem is with DA, then that is what should be improved.

If this transform is a pessimization for some code, it would be good to actually see that code.

I hope you find the detailed comment below by @bmahjour (thanks, @bmahjour) a reasonable ground for this change.

The theory of data dependence analysis relies on presence of subscripts in array references to be able to produce accurate results. I don't see how we can "improve DA" to address memset/memcpies short of turning them back into loop nests before applying the dependence tests. To do that the loop has to either be materialized before the DA analysis pass is run, or somehow SCEV expressions representing the implied subscripts be synthesized out of thin air. The former must be achieved by a transformation pass, so we would have to turn memset/memcpys into loop nests as soon as possible. For memset/memcpy calls generated by the loop idiom pass, the ideal place for that transformation would be immediately after loop idiom itself, which would have the same effect as preventing loop idiom from creating such loops in the first place when it knows they are not profitable. I don't know of any possible way to do the latter.

I agree with @fhahn that this is more of a cost-modeling issue. I think the cost-modeling would have to rely heavily on loop tripcount data which, in the general case, is only available through PGO, so an option to disable it for users who don't want to use PGO makes sense to me.

In D86262#2228456, @bmahjour wrote:

I agree with that, it seems to be better to improve DA. Is it feasible?

The theory of data dependence analysis relies on presence of subscripts in array references to be able to produce accurate results. I don't see how we can "improve DA" to address memset/memcpies short of turning them back into loop nests before applying the dependence tests. To do that the loop has to either be materialized before the DA analysis pass is run, or somehow SCEV expressions representing the implied subscripts be synthesized out of thin air. The former must be achieved by a transformation pass, so we would have to turn memset/memcpys into loop nests as soon as possible. For memset/memcpy calls generated by the loop idiom pass, the ideal place for that transformation would be immediately after loop idiom itself, which would have the same effect as preventing loop idiom from creating such loops in the first place when it knows they are not profitable. I don't know of any possible way to do the latter.

IIUC LoopIdiom will effectively remove a loop and replace it with a memset/memcopy. So we should have the same information, just in different forms: loop that writes successive memory locations or a single call that we know writes to the same locations. I think it would be good to have a concrete motivating example that highlights what exactly goes wrong.

I agree with @fhahn that this is more of a cost-modeling issue. I think the cost-modeling would have to rely heavily on loop tripcount data which, in the general case, is only available through PGO, so an option to disable it for users who don't want to use PGO makes sense to me.

I don't think the pass creating memset/memcpy that are not profitable is a problem only for 'highly tuned libraries'. It is a problem for any code. I would argue if we cannot prove that it is likely to be profitable to optimize. So if we do not know the trip count (or do not have a good estimate), we should not create memsets/memcpys. Again, a concrete motivating example would be helpful.

... a concrete motivating example would be helpful.

I will create an example which I can share it here (without violating the non-disclosure terms on my side).

In D86262#2228156, @fhahn wrote:

In D86262#2228123, @nemanjai wrote:

I agree that the justification provided seems somewhat inadequate - using memset/memcpy as a canonical form and teaching DA to handle it seems quite reasonable. However, there are libraries that are highly tuned wrt. when to call these functions and when not to. So it seems to me perfectly reasonable to provide an option to disable this for such uses. For example, a function may perform a memcpy/memset operation on data that is known to only be called only from range-limited sites. The compiler (without PGO+LTO which are often not an option) cannot know that it should expand such calls in the back end.

The issue you raise seems to be a cost-modeling issue, right? Ideally the compiler would know whether it is faster to use a loop or a memcpy. IIUC you are worried about the cases where the compiler replaces a loop with a memset, but the loop would be faster? Is there a way to improve the cost-modeling? IMO adding options like this potentially leads to papering over cost-modeling issues, rather than addressing the underlying issue. Also, such options tend to now work well together with LTO.

I suppose we can certainly treat this (at least in part) as a cost modeling issue - in that AFAICT, LIR does not do any cost modeling. If it is able to transform a loop into a memcpy, it does. If memcpy is the canonical form, this is OK as a canonicalization pass. However, given that we don't really get to expand these things into loops later on, it is a problem.

For memcmp, we added code (3a7578c6589b910f9a04bae7f7f121dfe3281578) to expand them that ultimately got moved out into a separate pass (063bed9baff63a0d716a5c9533cf2601dafbe0e0). I don't really remember the details of how much handling it has for non-constant lengths but presumably it does (or at least we have an obvious place to add it).
However, I do not think we have a similar pass for memcpy and memset. When we get into the SDAG, we have TargetLowering::findOptimalMemOpLowering() which can only handle constant lengths and presumably cannot expand into loops (being basic block local and all).

When we combine all of the above, code like this ends up with a call to memcpy:

void smallcp(unsigned *__restrict a, unsigned *__restrict b, unsigned len) {
  for (unsigned i = 0; i < (len % 4); i++)
    a[i] = b[i];
}

And my comment regarding "highly tuned libraries" was really meant to suggest a situation where for example, the rem is on the caller side in a different module so even a perfect cost model wouldn't help (well without LTO anyway).

To avoid any issue with NDA, I wrote a simple test as follows. This test program tries to prove the fact that, the impact of Loop Idiom Recognize Pass (LIGP)’s replacing store with memset is not always a positive one.

Compiler work.c to an IR file called work.ll. We will use this IR file work.ll for both LIRP and LIRP --disable-loop-idiom=memset
Call opt with -loop-idiom to produce an IR file containing memset instruction

opt -basic-aa -loop-idiom -S work.ll -o work.yes.ll

Call opt with -loop-idiom --disable-loop-idiom=memset to produce an IR file without memset instruction

opt -basic-aa -loop-idiom --disable-loop-idiom=memset -S work.ll -o work.no.ll

Inspect to make sure LIRP did replace store with memset in work.yes.ll, but not in work.no.ll
Compile the test.c, and link it with the IR from (2) and then the IR from (3).

clang++ -c test.c
clang++ test.o work.yes.ll -o yes
clang++ test.o work.no.ll -o no

Run both the executables on a quiet machine. On my performance machine, times spent are:

With memset: Time elapsed: 1.4215
Without memset: Time elapsed: 1.3611

test.c

#include <stdio.h>
#include <time.h>

int work(int A[], int sizeI, int sizeL);

int main() {
  int A[3] = {1, 2, 3};
  int res = 1;
  clock_t begin = clock();
  res = work(A, 9999, 3);
  clock_t end = clock();
  double time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

  printf("Time elapsed: %4.4f\n", time_spent);
  return res;
}

work.c

int work(int A[], int sizeI, int arraySize) {

    for (int i = 0; i < sizeI; ++i)
      for (int j = 0; j < sizeI; ++j)
        for (int k = 0; k < arraySize; ++k)
          A[k] = 0;

  return A[arraySize - 1];
}

IR before calling opt with -loop-idiom

; ModuleID = './work.ll'
source_filename = "work.c"
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

; Function Attrs: noinline nounwind
define dso_local signext i32 @_Z4workPiii(i32* %A, i32 signext %sizeI, i32 signext %arraySize)  #0 {
entry:
  %cmp6 = icmp slt i32 0, %sizeI
  br i1 %cmp6, label %for.body.preheader, label %for.end12

for.body.preheader:                               ; preds = %entry
  br label %for.body

for.body:                                         ; preds = %for.body.preheader, %for.inc10
  %i.07 = phi i32 [ %inc11, %for.inc10 ], [ 0, %for.body.preheader ]
  %cmp23 = icmp slt i32 0, %sizeI
  br i1 %cmp23, label %for.body3.preheader, label %for.inc10

for.body3.preheader:                              ; preds = %for.body
  br label %for.body3

for.body3:                                        ; preds = %for.body3.preheader, %for.inc7
  %j.04 = phi i32 [ %inc8, %for.inc7 ], [ 0, %for.body3.preheader ]
  %cmp51 = icmp slt i32 0, %arraySize
  br i1 %cmp51, label %for.body6.preheader, label %for.inc7

for.body6.preheader:                              ; preds = %for.body3
  br label %for.body6

for.body6:                                        ; preds = %for.body6.preheader, %for.body6
  %k.02 = phi i32 [ %inc, %for.body6 ], [ 0, %for.body6.preheader ]
  %idxprom = sext i32 %k.02 to i64
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nsw i32 %k.02, 1
  %cmp5 = icmp slt i32 %inc, %arraySize
  br i1 %cmp5, label %for.body6, label %for.inc7.loopexit

for.inc7.loopexit:                                ; preds = %for.body6
  br label %for.inc7

for.inc7:                                         ; preds = %for.inc7.loopexit, %for.body3
  %inc8 = add nsw i32 %j.04, 1
  %cmp2 = icmp slt i32 %inc8, %sizeI
  br i1 %cmp2, label %for.body3, label %for.inc10.loopexit

for.inc10.loopexit:                               ; preds = %for.inc7
  br label %for.inc10

for.inc10:                                        ; preds = %for.inc10.loopexit, %for.body
  %inc11 = add nsw i32 %i.07, 1
  %cmp = icmp slt i32 %inc11, %sizeI
  br i1 %cmp, label %for.body, label %for.end12.loopexit

for.end12.loopexit:                               ; preds = %for.inc10
  br label %for.end12

for.end12:                                        ; preds = %for.end12.loopexit, %entry
  %sub = sub nsw i32 %arraySize, 1
  %idxprom13 = sext i32 %sub to i64
  %arrayidx14 = getelementptr inbounds i32, i32* %A, i64 %idxprom13
  %0 = load i32, i32* %arrayidx14, align 4
  ret i32 %0
}

attributes #0 = { noinline nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="pwr9" "target-features"="+altivec,+bpermd,+crypto,+direct-move,+extdiv,+htm,+power8-vector,+power9-vector,+vsx,-spe" "unsafe-fp-math"="false" "use-soft-float"="false"
 }

Extract from IR with memset

for.body6.preheader:                              ; preds = %for.body3
  call void @llvm.memset.p0i8.i64(i8* align 4 %A1, i8 0, i64 %1, i1 false)
  br label %for.body6

for.body6:                                        ; preds = %for.body6, %for.body6.preheader
  %k.02 = phi i32 [ %inc, %for.body6 ], [ 0, %for.body6.preheader ]
  %idxprom = sext i32 %k.02 to i64
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %idxprom
  %inc = add nsw i32 %k.02, 1
  %cmp5 = icmp slt i32 %inc, %arraySize
  br i1 %cmp5, label %for.body6, label %for.inc7.loopexit

Extract from IR without memset

for.body6.preheader:                              ; preds = %for.body3
  br label %for.body6

for.body6:                                        ; preds = %for.body6, %for.body6.preheader
  %k.02 = phi i32 [ %inc, %for.body6 ], [ 0, %for.body6.preheader ]
  %idxprom = sext i32 %k.02 to i64
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %idxprom
  store i32 0, i32* %arrayidx, align 4
  %inc = add nsw i32 %k.02, 1
  %cmp5 = icmp slt i32 %inc, %arraySize
  br i1 %cmp5, label %for.body6, label %for.inc7.loopexit

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

This revision now requires changes to proceed.Aug 21 2020, 1:58 PM

In D86262#2231209, @lebedev.ri wrote:

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
117	Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they should specify `--disable-loop-idiom=all`. But what if another idiom is added to the LoopIdiomRecognize pass in the future? `--disable-loop-idiom=all` will presumably disable more than what the user asked for. Would it be possible to handle multiple `--disable-loop-idiom=<something>` options cumulatively? If not, does it make sense to just provide one option to disable the memcpy+memset transformation (i.e. both or neither) instead of this option with all/none/memcpy/memset fine-grain control? I agree and also think is overkill to provide options to disable only part of the transformation, at least at this point. IMO is sufficient to start with a all or nothing option to disable the entire transformation (of course the transformation should continue to be enabled by default).

Transforming a loop into a memset or memcpy is not *always* profitable (it depends on how many elements are initialized/copied and on the efficiency of the target architecture implementation for those libraries) but is often better than a loop. The low level optimizer should change short memset/memcpy back into a sequence of assignments, IMO this should not be done in opt because the exact length for which memset is less profitable than individual assignment is a function of the target architecture. As for this PR, adding an option to disable the optimization is a good thing as it provides more flexibility to users that for whatever reason do not want the transformation to run (and is also handy for compiler developers when debugging code). And more flexibility is a good thing.

Here are a couple of programs to test the performance of a simple memset vs a simple initialization loop. On my system (PPC) even this short init. loop is slower than the memset.

% cat loop.c
#include <string.h>
int main() {
  int A[N];
  for (int n=0; n<STEPS; ++n)
    for(int i=0;i<N;++i)
      A[i] = 0;
  return A[0];
}

% gcc -O0 loop.c -DN=10 -DSTEPS=1000000; time ./a.out
./a.out 0.10s user 0.00s system 99% cpu 0.099 total

% cat memset.c
#include <string.h>
int main() {
  int A[N];
  for (int n=0; n<STEPS; ++n)
    memset(A, 0, N * sizeof(int));
  return A[0];
}

% gcc -O0 memset.c -DN=10 -DSTEPS=1000000; time ./a.out
./a.out 0.02s user 0.00s system 99% cpu 0.022 total

Address comments by the reviewers.

Herald added a subscriber: jfb. · View Herald TranscriptAug 24 2020, 8:56 AM

In D86262#2231209, @lebedev.ri wrote:

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

It is not easy to provide a test case showing that the DA is confused after memset/memcpy function calls are used. For the purpose of this patch, Ettore (thanks @etiotto) and I have shown a few simple testcases where the performance is negatively impacted by the existence of memset (which was inserted by LIRP). I hope you consider it sufficient for the patch to be adopted.

Thanks for sharing the examples, IIUC they illustrate the issue with the cost modeling, which is the main motivation for the patch, rather than DA issues mentioned initially?

In D86262#2233615, @etiotto wrote:

Transforming a loop into a memset or memcpy is not always profitable (it depends on how many elements are initialized/copied and on the efficiency of the target architecture implementation for those libraries). The low level optimizer should change short memset/memcpy back into a sequence of assignments, IMO this should not be done in opt because the exact length for which memset is less profitable than individual assignment is a function of the target architecture. As for this PR, adding an option to disable the optimization is a good thing as it provides more flexibility to users that for whatever reason do not want the transformation to run (and is also handy for compiler developers when debugging code). And more flexibility is a good thing.

The option to disable the pass seems indeed convenient in the short term, but I am not sure if it is really helpful in the long run. It does not address the underlying issue (bad/non-existing cost model) and it means we generates sub-optimal code in some cases for the vast majority of users (which will neither know nor set this special flag). Also, will the option work with LTO? I think Clang usually does not pass -mllvm options to the linker/LTO plugin.

I think such an option is fine as a temporary stop-gap solution, but the goal should be to fix the underlying issue IMO. Otherwise I am worried that the option reduces the incentive to fix the cost-modeling (which should be a fix-able issue).

It looks like there are some un-addressed comments with respect to the exact way to disable the transformation. If there's a single option to disable the whole pass, it might be worth considering to not add it to the pipeline, rather than bailing out in the pass. Also, it would be good to update the description of the patch with the actual motivation.

anhtuyen added inline comments.Aug 24 2020, 9:08 AM

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
117	Thank you very much Rafik @rzurob and Ettore @etiotto for your opinion. I have switched from cl::opt to cl::bits, which enabled us to handle multiple --disable-loop-idiom=<something> options cumulatively. I hope that this approach addressed your concerns.

anhtuyen edited the summary of this revision. (Show Details)Aug 24 2020, 9:13 AM

I removed the details about DA being confused by the insertion of memset/memcpy by LIRP in the description.

lebedev.ri requested changes to this revision.Aug 24 2020, 9:31 AM

This revision now requires changes to proceed.Aug 24 2020, 9:31 AM

Harbormaster completed remote builds in B69329: Diff 287413.Aug 24 2020, 9:32 AM

Hi @lebedev.ri
I saw you had again requested changes to this revision, but I could not find what changes that you thought should be required. Can you please be specific?

In D86262#2233995, @anhtuyen wrote:

Hi @lebedev.ri
I saw you had again requested changes to this revision, but I could not find what changes that you thought should be required. Can you please be specific?

I'd like to better understand the roadmap here.
Initially the patch was submitted as a bandaid for DA, but then it was, taken over?, by general performance concerns.

Is there or is there no a DA problem for these intrinsics?
1. Is there a bugreport?
2. Is there a testcase?
3. If the patch proceeds as-is, what's long-term plan? Is anyone planning on addressing the underlying problem?
For the transform itself
1. Is there a bugreport
2. Is there a testcase? (i can see it in comments, but it should be in bugreport)
3. Is anyone planning on fixing this in more long-term way, with costmodelling and whatnot?

In none of those cases is such a flag a reasonable long-term solution.

@lebedev.ri

I'd like to see a PhaseOrdering test showing the IR that is not optimized due to the DA being unaware about these intrinsics.

Here's a test that shows a memset created by loop-idiom can prevent loop interchange because the dependence vectors are pessimized:

; ModuleID = 'interchange.ll'
source_filename = "interchange.c"
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

; Function Attrs: nounwind
define dso_local void @foo([1024 x i32]* %a, [1024 x i32]* %b, [1024 x [1024 x i32]]* %c, i32 signext %n) #0 {
entry:
  br label %for.body

for.body:                                         ; preds = %entry, %for.inc16
  %indvars.iv7 = phi i64 [ 0, %entry ], [ %indvars.iv.next8, %for.inc16 ]
  br label %for.body4

for.body4:                                        ; preds = %for.body, %for.inc13
  %indvars.iv4 = phi i64 [ 0, %for.body ], [ %indvars.iv.next5, %for.inc13 ]
  br label %for.body8

for.body8:                                        ; preds = %for.body4, %for.body8
  %indvars.iv = phi i64 [ 0, %for.body4 ], [ %indvars.iv.next, %for.body8 ]
  %arrayidx12 = getelementptr inbounds [1024 x [1024 x i32]], [1024 x [1024 x i32]]* %c, i64 %indvars.iv4, i64 %indvars.iv7, i64 %indvars.iv
  store i32 0, i32* %arrayidx12, align 4, !tbaa !2
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 1024
  br i1 %exitcond, label %for.body8, label %for.inc13

for.inc13:                                        ; preds = %for.body8
  %indvars.iv.next5 = add nuw nsw i64 %indvars.iv4, 1
  %exitcond6 = icmp ne i64 %indvars.iv.next5, 1024
  br i1 %exitcond6, label %for.body4, label %for.inc16

for.inc16:                                        ; preds = %for.inc13
  %indvars.iv.next8 = add nuw nsw i64 %indvars.iv7, 1
  %exitcond9 = icmp ne i64 %indvars.iv.next8, 1024
  br i1 %exitcond9, label %for.body, label %for.end18

for.end18:                                        ; preds = %for.inc16
  ret void
}

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1

; Function Attrs: argmemonly nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1

attributes #0 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="true" "no-jump-tables"="false" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="ppc64le" "target-features"="+altivec,+bpermd,+crypto,+direct-move,+extdiv,+htm,+power8-vector,+vsx,-power9-vector,-spe" "unsafe-fp-math"="true" "use-soft-float"="false" }
attributes #1 = { argmemonly nounwind willreturn }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"XL C/C++ for Linux on Power, (IBM Internal Development Branch), clang version 12.0.0 (git@github.ibm.com:compiler/llvm-project.git 4285385d877abd46f7a8c81d52e7aeff7a92b6a1)"}
!2 = !{!3, !3, i64 0}
!3 = !{!"int", !4, i64 0}
!4 = !{!"omnipotent char", !5, i64 0}
!5 = !{!"Simple C/C++ TBAA"}

without loop-idiom:

> opt interchange.simp.ll -S -basic-aa -loop-interchange -o out.ll -stats -debug-only=loop-interchange -loop-interchange-threshold=-2 2>&1 | grep -E -- 'legality|loop-interchange'
2 loop-interchange - Number of loops interchanged

with loop-idiom:

> opt interchange.simp.ll -S -basic-aa -loop-idiom -loop-interchange -o out.ll -stats -debug-only=loop-interchange -loop-interchange-threshold=-2 2>&1 | grep -E -- 'legality|loop-interchange'
Not interchanging loops. Cannot prove legality.

I think such an option is fine as a temporary stop-gap solution, but the goal should be to fix the underlying issue IMO.

The long term solution would be to make loop-idiom consider compile-time or PGO loop trip count data when deciding to transform loops. For non-constant non-PGO cases the cost modeling would be a heuristic at best, in which case the option would still be useful.

Otherwise I am worried that the option reduces the incentive to fix the cost-modeling (which should be a fix-able issue).

On the other hand, having an option makes it easier to identify opportunities and collect performance data for developing a cost model and tuning heuristics.

Thanks @bmahjour for uploading a DA test case. I had difficulty coming up with a similar test myself.
To @lebedev.ri : I think the comments from @bmahjour might have answered most of your questions. If not, please list them here and I will try to address them. If you insist in having the testcases in bugreport, I will do that. It might take some time though, since I am still waiting for my password to be reset via the webtools (contacted bugs-admin).

Bugreport: https://bugs.llvm.org/show_bug.cgi?id=47300

Update with cl::location and remove the extra - from the testcase (eg from --disable... to -disable).

Harbormaster completed remote builds in B69916: Diff 288602.Aug 28 2020, 8:19 AM

nikic resigned from this revision.Aug 28 2020, 9:08 AM

Hello @lebedev.ri
Bardia and I have provided the answer to your concerns, but we have not heard from you since last week. Please let me know what other changes, if any, that you still need to see.

lebedev.ri resigned from this revision.Aug 31 2020, 7:02 AM

This revision is now accepted and ready to land.Aug 31 2020, 7:02 AM

Closed by commit rG68717acb24e5: [LoopIdiomRecognizePass] Options to disable part or the entire Loop Idiom… (authored by anhtuyen). · Explain WhySep 1 2020, 7:00 AM

This revision was automatically updated to reflect the committed changes.

anhtuyen added a commit: rG68717acb24e5: [LoopIdiomRecognizePass] Options to disable part or the entire Loop Idiom….

In D86262#2234176, @bmahjour wrote:

I think such an option is fine as a temporary stop-gap solution, but the goal should be to fix the underlying issue IMO.

The long term solution would be to make loop-idiom consider compile-time or PGO loop trip count data when deciding to transform loops. For non-constant non-PGO cases the cost modeling would be a heuristic at best, in which case the option would still be useful.

Otherwise I am worried that the option reduces the incentive to fix the cost-modeling (which should be a fix-able issue).

On the other hand, having an option makes it easier to identify opportunities and collect performance data for developing a cost model and tuning heuristics.

Sure, I am looking forward to patches improving the cost-modeling.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LoopIdiomRecognize.cpp

31 lines

test/

Transforms/

LoopIdiom/

disable-options.ll

167 lines

Diff 287413

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp

Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "loop-idiom"		#define DEBUG_TYPE "loop-idiom"

STATISTIC(NumMemSet, "Number of memset's formed from loop stores");		STATISTIC(NumMemSet, "Number of memset's formed from loop stores");
STATISTIC(NumMemCpy, "Number of memcpy's formed from loop load+stores");		STATISTIC(NumMemCpy, "Number of memcpy's formed from loop load+stores");

		enum class DisableKind {
		All, // Disabled loop idiom recognize pass
		Memcpy, // Proceed, but do not convert loop to memcpy
		Memset // Proceed, but do not convert loop to memset
		};
		dfukalovUnsubmitted Not Done Reply Inline Actions Please prefer single-line comments in such cases. dfukalov: Please prefer single-line comments in such cases.
		anhtuyenAuthorUnsubmitted Done Reply Inline Actions Thank you! I will change the style of comments. anhtuyen: Thank you! I will change the style of comments.

		static cl::bits<DisableKind> DisableLIRP(
		"disable-" DEBUG_TYPE, cl::ReallyHidden,
		rzurobUnsubmitted Not Done Reply Inline Actions Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they should specify `--disable-loop-idiom=all`. But what if another idiom is added to the LoopIdiomRecognize pass in the future? `--disable-loop-idiom=all` will presumably disable more than what the user asked for. Would it be possible to handle multiple `--disable-loop-idiom=<something>` options cumulatively? If not, does it make sense to just provide one option to disable the memcpy+memset transformation (i.e. both or neither) instead of this option with all/none/memcpy/memset fine-grain control? rzurob: Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they…
		anhtuyenAuthorUnsubmitted Done Reply Inline Actions You raised a good point, thank you @rzurob. Let me re-think about that. anhtuyen: You raised a good point, thank you @rzurob. Let me re-think about that.
		etiottoUnsubmitted Not Done Reply Inline Actions Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they should specify `--disable-loop-idiom=all`. But what if another idiom is added to the LoopIdiomRecognize pass in the future? `--disable-loop-idiom=all` will presumably disable more than what the user asked for. Would it be possible to handle multiple `--disable-loop-idiom=<something>` options cumulatively? If not, does it make sense to just provide one option to disable the memcpy+memset transformation (i.e. both or neither) instead of this option with all/none/memcpy/memset fine-grain control? I agree and also think is overkill to provide options to disable only part of the transformation, at least at this point. IMO is sufficient to start with a all or nothing option to disable the entire transformation (of course the transformation should continue to be enabled by default). etiotto: > Based on this option, if a user doesn't want loops to be converted to memcpy or memset, they…
		anhtuyenAuthorUnsubmitted Done Reply Inline Actions Thank you very much Rafik @rzurob and Ettore @etiotto for your opinion. I have switched from cl::opt to cl::bits, which enabled us to handle multiple --disable-loop-idiom=<something> options cumulatively. I hope that this approach addressed your concerns. anhtuyen: Thank you very much Rafik @rzurob and Ettore @etiotto for your opinion. I have switched from…
		cl::desc("Options to disable Loop Idiom Recognize Pass."),
		cl::values(
		clEnumValN(DisableKind::All, "all",
		"Disable Loop Idiom Reconize Pass."),
		clEnumValN(DisableKind::Memcpy, "memcpy",
		"Proceed with loop idiom recognize pass, but do not convert "
		"loop(s) to memcpy."),
		clEnumValN(DisableKind::Memset, "memset",
		"Proceed with loop idiom recognize pass, but do not convert "
		"loop(s) to memset.")));

static cl::opt<bool> UseLIRCodeSizeHeurs(		static cl::opt<bool> UseLIRCodeSizeHeurs(
"use-lir-code-size-heurs",		"use-lir-code-size-heurs",
cl::desc("Use loop idiom recognition code size heuristics when compiling"		cl::desc("Use loop idiom recognition code size heuristics when compiling"
"with -Os/-Oz"),		"with -Os/-Oz"),
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

namespace {		namespace {

▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	public:
static char ID;		static char ID;

explicit LoopIdiomRecognizeLegacyPass() : LoopPass(ID) {		explicit LoopIdiomRecognizeLegacyPass() : LoopPass(ID) {
initializeLoopIdiomRecognizeLegacyPassPass(		initializeLoopIdiomRecognizeLegacyPassPass(
*PassRegistry::getPassRegistry());		*PassRegistry::getPassRegistry());
}		}

bool runOnLoop(Loop *L, LPPassManager &LPM) override {		bool runOnLoop(Loop *L, LPPassManager &LPM) override {
		if (DisableLIRP.isSet(DisableKind::All))
		Eugene.ZelenkoUnsubmitted Not Done Reply Inline Actions Unnecessary empty line. Eugene.Zelenko: Unnecessary empty line.
		anhtuyenAuthorUnsubmitted Done Reply Inline Actions Thanks, the blank line has been removed. anhtuyen: Thanks, the blank line has been removed.
		dfukalovUnsubmitted Not Done Reply Inline Actions You can omit .getValue() here and below. dfukalov: You can omit .getValue() here and below.
		anhtuyenAuthorUnsubmitted Done Reply Inline Actions Thank you @dfukalov . I will remove .getValue() . anhtuyen: Thank you @dfukalov . I will remove .getValue() .
		return false;

if (skipLoop(L))		if (skipLoop(L))
return false;		return false;

AliasAnalysis *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AliasAnalysis *AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();
DominatorTree *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DominatorTree *DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
ScalarEvolution *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();		ScalarEvolution *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
TargetLibraryInfo *TLI =		TargetLibraryInfo *TLI =
Show All 29 Lines

} // end anonymous namespace		} // end anonymous namespace

char LoopIdiomRecognizeLegacyPass::ID = 0;		char LoopIdiomRecognizeLegacyPass::ID = 0;

PreservedAnalyses LoopIdiomRecognizePass::run(Loop &L, LoopAnalysisManager &AM,		PreservedAnalyses LoopIdiomRecognizePass::run(Loop &L, LoopAnalysisManager &AM,
LoopStandardAnalysisResults &AR,		LoopStandardAnalysisResults &AR,
LPMUpdater &) {		LPMUpdater &) {
		if (DisableLIRP.isSet(DisableKind::All))
		return PreservedAnalyses::all();

const auto *DL = &L.getHeader()->getModule()->getDataLayout();		const auto *DL = &L.getHeader()->getModule()->getDataLayout();

// For the new PM, we also can't use OptimizationRemarkEmitter as an analysis		// For the new PM, we also can't use OptimizationRemarkEmitter as an analysis
// pass. Function analyses need to be preserved across loop transformations		// pass. Function analyses need to be preserved across loop transformations
// but ORE cannot be preserved (see comment before the pass definition).		// but ORE cannot be preserved (see comment before the pass definition).
OptimizationRemarkEmitter ORE(L.getHeader()->getParent());		OptimizationRemarkEmitter ORE(L.getHeader()->getParent());

LoopIdiomRecognize LIR(&AR.AA, &AR.DT, &AR.LI, &AR.SE, &AR.TLI, &AR.TTI,		LoopIdiomRecognize LIR(&AR.AA, &AR.DT, &AR.LI, &AR.SE, &AR.TLI, &AR.TTI,
▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	LoopIdiomRecognize::isLegalStore(StoreInst *SI) {
Value SplatValue = isBytewiseValue(StoredVal, DL);		Value SplatValue = isBytewiseValue(StoredVal, DL);
Constant *PatternValue = nullptr;		Constant *PatternValue = nullptr;

// Note: memset and memset_pattern on unordered-atomic is yet not supported		// Note: memset and memset_pattern on unordered-atomic is yet not supported
bool UnorderedAtomic = SI->isUnordered() && !SI->isSimple();		bool UnorderedAtomic = SI->isUnordered() && !SI->isSimple();

// If we're allowed to form a memset, and the stored value would be		// If we're allowed to form a memset, and the stored value would be
// acceptable for memset, use it.		// acceptable for memset, use it.
if (!UnorderedAtomic && HasMemset && SplatValue &&		if (!UnorderedAtomic && HasMemset &&
		!DisableLIRP.isSet(DisableKind::Memset) && SplatValue &&
// Verify that the stored value is loop invariant. If not, we can't		// Verify that the stored value is loop invariant. If not, we can't
// promote the memset.		// promote the memset.
CurLoop->isLoopInvariant(SplatValue)) {		CurLoop->isLoopInvariant(SplatValue)) {
// It looks like we can use SplatValue.		// It looks like we can use SplatValue.
return LegalStoreKind::Memset;		return LegalStoreKind::Memset;
} else if (!UnorderedAtomic && HasMemsetPattern &&		} else if (!UnorderedAtomic && HasMemsetPattern &&
		!DisableLIRP.isSet(DisableKind::Memset) &&
// Don't create memset_pattern16s with address spaces.		// Don't create memset_pattern16s with address spaces.
StorePtr->getType()->getPointerAddressSpace() == 0 &&		StorePtr->getType()->getPointerAddressSpace() == 0 &&
(PatternValue = getMemSetPatternValue(StoredVal, DL))) {		(PatternValue = getMemSetPatternValue(StoredVal, DL))) {
// It looks like we can use PatternValue!		// It looks like we can use PatternValue!
return LegalStoreKind::MemsetPattern;		return LegalStoreKind::MemsetPattern;
}		}

// Otherwise, see if the store can be turned into a memcpy.		// Otherwise, see if the store can be turned into a memcpy.
if (HasMemcpy) {		if (HasMemcpy && !DisableLIRP.isSet(DisableKind::Memcpy)) {
// Check to see if the stride matches the size of the store. If so, then we		// Check to see if the stride matches the size of the store. If so, then we
// know that every byte is touched in the loop.		// know that every byte is touched in the loop.
APInt Stride = getStoreStride(StoreEv);		APInt Stride = getStoreStride(StoreEv);
unsigned StoreSize = DL->getTypeStoreSize(SI->getValueOperand()->getType());		unsigned StoreSize = DL->getTypeStoreSize(SI->getValueOperand()->getType());
if (StoreSize != Stride && StoreSize != -Stride)		if (StoreSize != Stride && StoreSize != -Stride)
return LegalStoreKind::None;		return LegalStoreKind::None;

// The store must be feeding a non-volatile load.		// The store must be feeding a non-volatile load.
▲ Show 20 Lines • Show All 1,383 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopIdiom/disable-options.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -basic-aa -loop-idiom < %s -S \| FileCheck %s --check-prefix=DIS-NONE
				; RUN: opt -basic-aa -loop-idiom --disable-loop-idiom=all < %s -S \| FileCheck %s --check-prefix=DIS-ALL
				; RUN: opt -basic-aa -loop-idiom --disable-loop-idiom=memcpy < %s -S \| FileCheck %s --check-prefix=DIS-MEMCPY
				; RUN: opt -basic-aa -loop-idiom --disable-loop-idiom=memset < %s -S \| FileCheck %s --check-prefix=DIS-MEMSET
				; RUN: opt -basic-aa -loop-idiom --disable-loop-idiom=memset --disable-loop-idiom=memcpy < %s -S \| FileCheck %s --check-prefix=DIS-ALL
				; RUN: opt -passes="loop-idiom" -aa-pipeline=basic-aa < %s -S \| FileCheck %s --check-prefix=DIS-NONE
				; RUN: opt -passes="loop-idiom" -aa-pipeline=basic-aa --disable-loop-idiom=all < %s -S \| FileCheck %s --check-prefix=DIS-ALL
				; RUN: opt -passes="loop-idiom" -aa-pipeline=basic-aa --disable-loop-idiom=memcpy < %s -S \| FileCheck %s --check-prefix=DIS-MEMCPY
				; RUN: opt -passes="loop-idiom" -aa-pipeline=basic-aa --disable-loop-idiom=memset < %s -S \| FileCheck %s --check-prefix=DIS-MEMSET
				; RUN: opt -passes="loop-idiom" -aa-pipeline=basic-aa --disable-loop-idiom=memset --disable-loop-idiom=memcpy < %s -S \| FileCheck %s --check-prefix=DIS-ALL

				define void @test-memcpy(i64 %Size) nounwind ssp {
				; DIS-NONE-LABEL: @test-memcpy(
				; DIS-NONE-NEXT: bb.nph:
				; DIS-NONE-NEXT: [[BASE:%.*]] = alloca i8, i32 10000, align 1
				; DIS-NONE-NEXT: [[DEST:%.*]] = alloca i8, i32 10000, align 1
				; DIS-NONE-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 [[DEST]], i8* align 1 [[BASE]], i64 [[SIZE:%.*]], i1 false)
				; DIS-NONE-NEXT: br label [[FOR_BODY:%.*]]
				; DIS-NONE: for.body:
				; DIS-NONE-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; DIS-NONE-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE]], i64 [[INDVAR]]
				; DIS-NONE-NEXT: [[DESTI:%.]] = getelementptr i8, i8 [[DEST]], i64 [[INDVAR]]
				; DIS-NONE-NEXT: [[V:%.]] = load i8, i8 [[I_0_014]], align 1
				; DIS-NONE-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; DIS-NONE-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE]]
				; DIS-NONE-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; DIS-NONE: for.end:
				; DIS-NONE-NEXT: ret void
				;
				; DIS-ALL-LABEL: @test-memcpy(
				; DIS-ALL-NEXT: bb.nph:
				; DIS-ALL-NEXT: [[BASE:%.*]] = alloca i8, i32 10000, align 1
				; DIS-ALL-NEXT: [[DEST:%.*]] = alloca i8, i32 10000, align 1
				; DIS-ALL-NEXT: br label [[FOR_BODY:%.*]]
				; DIS-ALL: for.body:
				; DIS-ALL-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; DIS-ALL-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE]], i64 [[INDVAR]]
				; DIS-ALL-NEXT: [[DESTI:%.]] = getelementptr i8, i8 [[DEST]], i64 [[INDVAR]]
				; DIS-ALL-NEXT: [[V:%.]] = load i8, i8 [[I_0_014]], align 1
				; DIS-ALL-NEXT: store i8 [[V]], i8* [[DESTI]], align 1
				; DIS-ALL-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; DIS-ALL-NEXT: [[EXITCOND:%.]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE:%.]]
				; DIS-ALL-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; DIS-ALL: for.end:
				; DIS-ALL-NEXT: ret void
				;
				; DIS-MEMCPY-LABEL: @test-memcpy(
				; DIS-MEMCPY-NEXT: bb.nph:
				; DIS-MEMCPY-NEXT: [[BASE:%.*]] = alloca i8, i32 10000, align 1
				; DIS-MEMCPY-NEXT: [[DEST:%.*]] = alloca i8, i32 10000, align 1
				; DIS-MEMCPY-NEXT: br label [[FOR_BODY:%.*]]
				; DIS-MEMCPY: for.body:
				; DIS-MEMCPY-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; DIS-MEMCPY-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE]], i64 [[INDVAR]]
				; DIS-MEMCPY-NEXT: [[DESTI:%.]] = getelementptr i8, i8 [[DEST]], i64 [[INDVAR]]
				; DIS-MEMCPY-NEXT: [[V:%.]] = load i8, i8 [[I_0_014]], align 1
				; DIS-MEMCPY-NEXT: store i8 [[V]], i8* [[DESTI]], align 1
				; DIS-MEMCPY-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; DIS-MEMCPY-NEXT: [[EXITCOND:%.]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE:%.]]
				; DIS-MEMCPY-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; DIS-MEMCPY: for.end:
				; DIS-MEMCPY-NEXT: ret void
				;
				; DIS-MEMSET-LABEL: @test-memcpy(
				; DIS-MEMSET-NEXT: bb.nph:
				; DIS-MEMSET-NEXT: [[BASE:%.*]] = alloca i8, i32 10000, align 1
				; DIS-MEMSET-NEXT: [[DEST:%.*]] = alloca i8, i32 10000, align 1
				; DIS-MEMSET-NEXT: call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 [[DEST]], i8* align 1 [[BASE]], i64 [[SIZE:%.*]], i1 false)
				; DIS-MEMSET-NEXT: br label [[FOR_BODY:%.*]]
				; DIS-MEMSET: for.body:
				; DIS-MEMSET-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; DIS-MEMSET-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE]], i64 [[INDVAR]]
				; DIS-MEMSET-NEXT: [[DESTI:%.]] = getelementptr i8, i8 [[DEST]], i64 [[INDVAR]]
				; DIS-MEMSET-NEXT: [[V:%.]] = load i8, i8 [[I_0_014]], align 1
				; DIS-MEMSET-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; DIS-MEMSET-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE]]
				; DIS-MEMSET-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; DIS-MEMSET: for.end:
				; DIS-MEMSET-NEXT: ret void
				;
				bb.nph:
				%Base = alloca i8, i32 10000
				%Dest = alloca i8, i32 10000
				br label %for.body

				for.body: ; preds = %bb.nph, %for.body
				%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.body ]
				%I.0.014 = getelementptr i8, i8* %Base, i64 %indvar
				%DestI = getelementptr i8, i8* %Dest, i64 %indvar
				%V = load i8, i8* %I.0.014, align 1
				store i8 %V, i8* %DestI, align 1
				%indvar.next = add i64 %indvar, 1
				%exitcond = icmp eq i64 %indvar.next, %Size
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				ret void
				}

				define void @test-memset(i8* %Base, i64 %Size) nounwind ssp {
				; CHECK-LABEL: @test-memset(
				; CHECK-NEXT: bb.nph:
				; CHECK-NEXT: call void @llvm.memset.p0i8.i64(i8* align 1 [[BASE:%.]], i8 0, i64 [[SIZE:%.]], i1 false)
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE]], i64 [[INDVAR]]
				; CHECK-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				; DIS-ALL-LABEL: @test-memset(
				; DIS-ALL-NEXT: bb.nph:
				; DIS-ALL-NEXT: br label [[FOR_BODY:%.*]]
				; DIS-ALL: for.body:
				; DIS-ALL-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; DIS-ALL-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE:%.*]], i64 [[INDVAR]]
				; DIS-ALL-NEXT: store i8 0, i8* [[I_0_014]], align 1
				; DIS-ALL-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; DIS-ALL-NEXT: [[EXITCOND:%.]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE:%.]]
				; DIS-ALL-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; DIS-ALL: for.end:
				; DIS-ALL-NEXT: ret void
				;
				; DIS-MEMCPY-LABEL: @test-memset(
				; DIS-MEMCPY-NEXT: bb.nph:
				; DIS-MEMCPY-NEXT: call void @llvm.memset.p0i8.i64(i8* align 1 [[BASE:%.]], i8 0, i64 [[SIZE:%.]], i1 false)
				; DIS-MEMCPY-NEXT: br label [[FOR_BODY:%.*]]
				; DIS-MEMCPY: for.body:
				; DIS-MEMCPY-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; DIS-MEMCPY-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE]], i64 [[INDVAR]]
				; DIS-MEMCPY-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; DIS-MEMCPY-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE]]
				; DIS-MEMCPY-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; DIS-MEMCPY: for.end:
				; DIS-MEMCPY-NEXT: ret void
				;
				; DIS-MEMSET-LABEL: @test-memset(
				; DIS-MEMSET-NEXT: bb.nph:
				; DIS-MEMSET-NEXT: br label [[FOR_BODY:%.*]]
				; DIS-MEMSET: for.body:
				; DIS-MEMSET-NEXT: [[INDVAR:%.]] = phi i64 [ 0, [[BB_NPH:%.]] ], [ [[INDVAR_NEXT:%.*]], [[FOR_BODY]] ]
				; DIS-MEMSET-NEXT: [[I_0_014:%.]] = getelementptr i8, i8 [[BASE:%.*]], i64 [[INDVAR]]
				; DIS-MEMSET-NEXT: store i8 0, i8* [[I_0_014]], align 1
				; DIS-MEMSET-NEXT: [[INDVAR_NEXT]] = add i64 [[INDVAR]], 1
				; DIS-MEMSET-NEXT: [[EXITCOND:%.]] = icmp eq i64 [[INDVAR_NEXT]], [[SIZE:%.]]
				; DIS-MEMSET-NEXT: br i1 [[EXITCOND]], label [[FOR_END:%.*]], label [[FOR_BODY]]
				; DIS-MEMSET: for.end:
				; DIS-MEMSET-NEXT: ret void
				;
				bb.nph: ; preds = %entry
				br label %for.body

				for.body: ; preds = %bb.nph, %for.body
				%indvar = phi i64 [ 0, %bb.nph ], [ %indvar.next, %for.body ]
				%I.0.014 = getelementptr i8, i8* %Base, i64 %indvar
				store i8 0, i8* %I.0.014, align 1
				%indvar.next = add i64 %indvar, 1
				%exitcond = icmp eq i64 %indvar.next, %Size
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				ret void
				}