This is an archive of the discontinued LLVM Phabricator instance.

Add loop-versioning pass to improve unit-stride
ClosedPublic

Authored by Leporacanthicus on Jan 9 2023, 10:25 AM.

Details

Summary

Introduce conditional code to identify stride of "one element", and simplify the array accesses for that case.

This allows better loop performance in various benchmarks.

Diff Detail

Event Timeline

Herald added projects: Restricted Project, Restricted Project. · View Herald Transcript
Herald added a subscriber: mehdi_amini. · View Herald Transcript
Leporacanthicus requested review of this revision.Jan 9 2023, 10:25 AM

Hi @Leporacanthicus, thank you for working on this! I wonder if you target any particular benchmark with these changes - can you please post any information in the description?

While I was comparing different compilers on SPEC CPU2000, 2006 and 2017 and Polyhedron, I believe I only saw one case where the missing "continuity" information affected a hotspot - this was in CPU2000/187.facerec, and the estimated improvement was ~14%. This was in graphRoutines.f90::LocalMove where the 2-level loop nest for the multiplication+reduction could be collapsed and vectorized with unit stride:

 274         Subroutine LocalMove (Params, Position, Graph, GaborTrafo, &
 275                                           Similarity, Hops, Sweeps)
...
 289         Real(4),        Intent(In)    :: Graph (:, :, :)
 290         Real(4),        Intent(In)    :: GaborTrafo (:, :, :, :)
...
 392                                                         NewJetSim = SUM (Graph (:, :, IG) * &
 393         &                                                GaborTrafo (:, :, NewX, NewY))

I wonder if you found more cases where this optimization may have significant impact.

I think you want to replace "assumed size array" with "assumed shape array" in the comments, since the assumed-size arrays should be contiguous by definition.

Hi @Leporacanthicus, thank you for working on this! I wonder if you target any particular benchmark with these changes - can you please post any information in the description?

This is work described here:
https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447/3

As mentioned, this (we believe) will benefit the Spec2017FP roms_r benchmark - probably others too.

At the moment, it's not really working even for simple cases, so that's a bit of an issue, but hopefully we can get there... :)

Hi @Leporacanthicus, thank you for working on this! I wonder if you target any particular benchmark with these changes - can you please post any information in the description?

This is work described here:
https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447/3

Thank you for the link!

Added "move remaining loop into else branch" and related changes.

Still doesn't actually work as intended.

Rebase to latest llvm/main.

Some progress, now identifying the correct loops a little better.

Support for 2D arrays, but not "any dimensions".

This patch does not work correctly, it needs further work.

Updated patch, now works for 1D and 2D arrays.

peixin added a subscriber: peixin.Feb 3 2023, 6:21 AM
peixin added inline comments.
flang/include/flang/Optimizer/Transforms/Passes.td
264

Update to remove WIP

Rebased to latest llvm/main

Still restricted to 2D arrays

Add tests.

Fix bug preventing 1D array being handled.

Leporacanthicus retitled this revision from WIP LoopVersioning to Add loop-versioning pass to improve unit-stride.Mar 3 2023, 2:42 AM
Leporacanthicus edited the summary of this revision. (Show Details)
tblah added a subscriber: tblah.Mar 3 2023, 6:38 AM
tblah added a comment.Mar 6 2023, 8:04 AM

Thanks for implementing this.

Why are only function arguments considered? What about unknown sized arrays from other sources?

flang/include/flang/Optimizer/Transforms/Passes.td
264
flang/include/flang/Tools/CLOptions.inc
186

nit: I think this should be wrapped in LLVM_DEBUG() and be written to llvm::dbgs()

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
72

See hlfir::getFortranElementOrSequenceType in HLFIRDialect.h. If you want to explicitly not support hlfir::Expr, you can still unwrap the box using fir::unwrapPassByRefType(fir::unwrapRefType(type)).

I think mlir::Types can be null so the std::optional isn't needed

111

This should be inside of LLVM_DEBUG

126

This can be written as func.walk([&](fir::DoLoopOp loop) {. Then the if statement isn't required.

143

I think we can break here because if this argument of interest was used as operand 0 to the coordinate op, later arguments cannot be the same argument.

195–198

I don't mind whether you decide to change this or not, but there's a convenience interface for this: see FirOpBuilder::IfBuilder

200

nit: LLVM_DEBUG

251

LLVM_DEBUG

Leporacanthicus marked 6 inline comments as done.

Updated to reflect review comments.

  • Use different type for Walk functions.
  • Use "unwrap" functions to get to types.
  • Fix some debug output and related stuff.
  • Some other minor changes, including clang-format fixups.
flang/include/flang/Tools/CLOptions.inc
186

Removing, it was left behind when I wasn't sure if the pass was being run or not. None of the other passes print when they are created.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
72

Removed the optional.

I've simplified the code using the unwrap functions. I'm not sure there is any benefit in bringing HLFIR into this - we shouldn't have HLFIR types here, should we? It also looks like the HLFIR code creates a NEW type, which seems a bit unnecessary.

126

Also useful in a few other places! :)

251

I changed it to an assert - it ought to not happen! ;) We probably want to know if it DOES happen!

tblah added inline comments.Mar 9 2023, 4:03 AM
flang/test/Transforms/loop-versioing.fir
53 ↗(On Diff #503460)

What if there is a non-default lower bound in #[[DIMS]]#0? Won't removing the box cause the lower bound to be ignored?

peixin added a comment.EditedMar 14 2023, 5:47 AM

I am thinking if we should do the loop versioning in MLIR since this may increase compilation time in optimization passes.

The loop vectorize pass can support the versioning in some cases. Check this case: https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopVectorize/version-mem-access.ll. It is SCEV analysis combined with runtime check (https://github.com/llvm/llvm-project/blob/c11c2f5f6548a5303517c89ba6629bbfa7fae0d9/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L1943-L1973). The original support is one C example similar to Fortran code: https://github.com/llvm/llvm-project/commit/c2e9d759f29602021daa26453452928c81adffbb.

I am not against this method for now since supporting this in Loop vectorization pass is not an easy work.

flang/include/flang/Optimizer/Transforms/Passes.td
265

I am curious what other optimizations would benefit from this?

I am thinking if we should do the loop versioning in MLIR since this may increase compilation time in optimization passes.

I have not noticed any measurable increase in compile time - Spec 2017 wrf_r takes within 1-2 seconds, and that's the one that takes about 15 minutes to build in total. I'm not saying it's impossible to come up with something that compiles slowly with this code, but I've made as good an attempt as I can to "exit early" if there's nothing that needs doing, and only duplicate the innermost loop [which potentially is not the most optimal case].

The loop vectorize pass can support the versioning in some cases. Check this case: https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopVectorize/version-mem-access.ll. It is SCEV analysis combined with runtime check (https://github.com/llvm/llvm-project/blob/c11c2f5f6548a5303517c89ba6629bbfa7fae0d9/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L1943-L1973). The original support is one C example similar to Fortran code: https://github.com/llvm/llvm-project/commit/c2e9d759f29602021daa26453452928c81adffbb.

I am not against this method for now since supporting this in Loop vectorization pass is not an easy work.

The problem I'm trying to solve here is that the Loop Vectorizer optimisation doesn't "like" the descriptor access to read the strides [and other descriptor accesses] - it doesn't understand that the value is a constant for a given run of a loop. With versioning , it basically makes the Loop Vectorizer "happy" with the loop, so it can do its own thing.

flang/include/flang/Optimizer/Transforms/Passes.td
265

Just generally "making loops simpler" (less indirect accesses), which allows the compiler to do a better job - the "other loop optimisatioins" is not specific passes or optimisation steps as such, just "improvement in the overall code being produced".

flang/test/Transforms/loop-versioing.fir
53 ↗(On Diff #503460)

I'm not entirely sure how that would end up being the case for the type of code we're processing here. I tried writing some code that used an array b(12:21) instead of b(10), and it goes wrong if I use the do i = 12, 21 in the loop in the called function, it assumes that b(:) is an array starting with 1, not 12, even though the original array is 12:21.

I have not noticed any measurable increase in compile time - Spec 2017 wrf_r takes within 1-2 seconds, and that's the one that takes about 15 minutes to build in total. I'm not saying it's impossible to come up with something that compiles slowly with this code, but I've made as good an attempt as I can to "exit early" if there's nothing that needs doing, and only duplicate the innermost loop [which potentially is not the most optimal case].

Great! This sounds good to me.

The problem I'm trying to solve here is that the Loop Vectorizer optimisation doesn't "like" the descriptor access to read the strides [and other descriptor accesses] - it doesn't understand that the value is a constant for a given run of a loop. With versioning , it basically makes the Loop Vectorizer "happy" with the loop, so it can do its own thing.

I think the loop vectorize pass is doing the similar thing. Check the generated LLVM IR the following Fortran case:

subroutine sum1d(a, n)
  real*8 :: a(:)
  integer :: n
  real*8 :: sum
  integer :: i
  sum = 0
  do i=1,n
     sum = sum + a(i)
  end do
  call temp(sum)
end subroutine sum1d
$ flang-new -fc1 -emit-llvm -O3 test.f90
$ cat test.ll
define void @sum1d_(ptr nocapture readonly %0, ptr nocapture readonly %1) local_unnamed_addr {
  %3 = alloca double, align 8
  store double 0.000000e+00, ptr %3, align 8, !tbaa !1
  %4 = load i32, ptr %1, align 4, !tbaa !1
  %5 = icmp sgt i32 %4, 0
  br i1 %5, label %.lr.ph, label %._crit_edge

.lr.ph:                                           ; preds = %2
  %6 = zext i32 %4 to i64
  %7 = load ptr, ptr %0, align 8, !tbaa !5
  %8 = getelementptr { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %0, i64 0, i32 7, i64 0, i64 2
  %9 = load i64, ptr %8, align 8, !tbaa !5
  br label %10

10:                                               ; preds = %.lr.ph, %10
  %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %10 ]
  %11 = phi double [ 0.000000e+00, %.lr.ph ], [ %16, %10 ]
  %12 = add nsw i64 %indvars.iv, -1
  %13 = mul i64 %9, %12
  %14 = getelementptr i8, ptr %7, i64 %13                                                            ! Note: This has the similar structure as the LLVM IR generated from the following C code.
  %15 = load double, ptr %14, align 8, !tbaa !1
  %16 = fadd contract double %11, %15
  store double %16, ptr %3, align 8, !tbaa !1
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv, %6
  br i1 %exitcond.not, label %._crit_edge, label %10

._crit_edge:                                      ; preds = %10, %2
  call void @temp_(ptr nonnull %3)
  ret void
}

When you compare it with the generated IR from the following C case using clang case.c -S -emit-llvm -O3:

void temp(double *sum);

void test1(double * restrict A, double * restrict B, int N, int Stride) {
  int i;
  for (i = 0; i < N; ++i)
    B[i * Stride] += A[i * Stride];
}

void test2(double * restrict A, int N, int Stride) {
  int i;
  double sum = 0.0;
  for (i = 0; i < N; ++i)
    sum += A[i * Stride];
  temp(&sum);
}

The loop in function test1 can be vectorized using stride-check loop versioning. I don't dig why test2 fail in vectorizing the reduction. Anyway, the loop vectorize pass can do some similar work. But I am not sure if the SCEV analysis there can handle the Fortran case.

I have not noticed any measurable increase in compile time - Spec 2017 wrf_r takes within 1-2 seconds, and that's the one that takes about 15 minutes to build in total. I'm not saying it's impossible to come up with something that compiles slowly with this code, but I've made as good an attempt as I can to "exit early" if there's nothing that needs doing, and only duplicate the innermost loop [which potentially is not the most optimal case].

Great! This sounds good to me.

The problem I'm trying to solve here is that the Loop Vectorizer optimisation doesn't "like" the descriptor access to read the strides [and other descriptor accesses] - it doesn't understand that the value is a constant for a given run of a loop. With versioning , it basically makes the Loop Vectorizer "happy" with the loop, so it can do its own thing.

I think the loop vectorize pass is doing the similar thing. Check the generated LLVM IR the following Fortran case:

subroutine sum1d(a, n)
  real*8 :: a(:)
  integer :: n
  real*8 :: sum
  integer :: i
  sum = 0
  do i=1,n
     sum = sum + a(i)
  end do
  call temp(sum)
end subroutine sum1d
$ flang-new -fc1 -emit-llvm -O3 test.f90
$ cat test.ll
define void @sum1d_(ptr nocapture readonly %0, ptr nocapture readonly %1) local_unnamed_addr {
  %3 = alloca double, align 8
  store double 0.000000e+00, ptr %3, align 8, !tbaa !1
  %4 = load i32, ptr %1, align 4, !tbaa !1
  %5 = icmp sgt i32 %4, 0
  br i1 %5, label %.lr.ph, label %._crit_edge

.lr.ph:                                           ; preds = %2
  %6 = zext i32 %4 to i64
  %7 = load ptr, ptr %0, align 8, !tbaa !5
  %8 = getelementptr { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %0, i64 0, i32 7, i64 0, i64 2
  %9 = load i64, ptr %8, align 8, !tbaa !5
  br label %10

10:                                               ; preds = %.lr.ph, %10
  %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %10 ]
  %11 = phi double [ 0.000000e+00, %.lr.ph ], [ %16, %10 ]
  %12 = add nsw i64 %indvars.iv, -1
  %13 = mul i64 %9, %12
  %14 = getelementptr i8, ptr %7, i64 %13                                                            ! Note: This has the similar structure as the LLVM IR generated from the following C code.
  %15 = load double, ptr %14, align 8, !tbaa !1
  %16 = fadd contract double %11, %15
  store double %16, ptr %3, align 8, !tbaa !1
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv, %6
  br i1 %exitcond.not, label %._crit_edge, label %10

._crit_edge:                                      ; preds = %10, %2
  call void @temp_(ptr nonnull %3)
  ret void
}

When you compare it with the generated IR from the following C case using clang case.c -S -emit-llvm -O3:

void temp(double *sum);

void test1(double * restrict A, double * restrict B, int N, int Stride) {
  int i;
  for (i = 0; i < N; ++i)
    B[i * Stride] += A[i * Stride];
}

void test2(double * restrict A, int N, int Stride) {
  int i;
  double sum = 0.0;
  for (i = 0; i < N; ++i)
    sum += A[i * Stride];
  temp(&sum);
}

The loop in function test1 can be vectorized using stride-check loop versioning. I don't dig why test2 fail in vectorizing the reduction. Anyway, the loop vectorize pass can do some similar work. But I am not sure if the SCEV analysis there can handle the Fortran case.

Yes, I've only spent SOME time trying to understand why vectorizer doesn't - it basically comes down to "can't figure out how the stride is calculated, and whether it may change over time". The loop versioning here is helping to solve that for the lower layers in the compiler - and in my experience (I've written quite a few different style loops and such), either some MLIR pass(es) manages to vectorize the codr, or LLVM loop vectorizer can't do it either. [Why is there two layers of vectorizers? I don't know - presumably people like to write code that does similar things on multiple levels].

Also, the overall goal isn't ONLY to vectorize, but also to "improve overall loop performance". It may be that we can convince the compiler to do a better job in some other way, but right now, I'm not aware of what that should be. The fundamental problem is that the descriptors used to pass array information is getting in the way of lots of things. It should be noted that although we see performance improvement in roms_r, it does NOT come from vectorization of the loop constructs - both gfortran and flang uses loop versioning, and gets similar performance improvement - and neither compiler uses vector instructions in the hot loops in ROMS.

Also, if you do something like:

template <typename T>
struct descr
{
     size_t stride;
     T* ptr;
};

void sum1d(restrict descr<double>* a, int n)
{
   int i;
  double sum = 0.0;
  for (i = 0; i < N; ++i)
    sum += a->ptr[i * a->Stride];
  temp(&sum);
}

The compiler (clang and gcc both) will generate "different" code than the plain passing a pointer + stride. For simple cases, where all the code is "visible" and the compiler can inline and propagate stuff, but if it's compiled separately, it "doesn't know what's going on, and gives up".

LGTM, but someone else should approve. Only comment is that the fir test is currently called loop-versioing.fir. Should probably be loop-versioning.fir?

tblah accepted this revision.Mar 15 2023, 8:29 AM

LGTM, thanks for addressing my comments and answering my questions.

This revision is now accepted and ready to land.Mar 15 2023, 8:29 AM

Yes, I've only spent SOME time trying to understand why vectorizer doesn't - it basically comes down to "can't figure out how the stride is calculated, and whether it may change over time". The loop versioning here is helping to solve that for the lower layers in the compiler - and in my experience (I've written quite a few different style loops and such), either some MLIR pass(es) manages to vectorize the codr, or LLVM loop vectorizer can't do it either. [Why is there two layers of vectorizers? I don't know - presumably people like to write code that does similar things on multiple levels].

OK.

Also, the overall goal isn't ONLY to vectorize, but also to "improve overall loop performance". It may be that we can convince the compiler to do a better job in some other way, but right now, I'm not aware of what that should be. The fundamental problem is that the descriptors used to pass array information is getting in the way of lots of things. It should be noted that although we see performance improvement in roms_r, it does NOT come from vectorization of the loop constructs - both gfortran and flang uses loop versioning, and gets similar performance improvement - and neither compiler uses vector instructions in the hot loops in ROMS.

I am confused about how the performance gets improved for roms_r you mentioned. From the test in this patch, I can only see the optimization opportunity of loop vectorization after loop versioning. Maybe another test case extracted in roms_r showing how it gets performance improvement is better.

BTW, can you attach the link of discussion in discourse in commit message.

FYI, another related scenario is https://github.com/llvm/llvm-project/issues/59388.

I also faced one case having loop interchange opportunity, which is blocked by the Fortran array stride information not understood by SCEV analysis. I think this is one common issue in loop optimization. So there maybe need some discussions whether loop versioning in MLIR pass is the right solution.

FYI, another related scenario is https://github.com/llvm/llvm-project/issues/59388.

I also faced one case having loop interchange opportunity, which is blocked by the Fortran array stride information not understood by SCEV analysis. I think this is one common issue in loop optimization. So there maybe need some discussions whether loop versioning in MLIR pass is the right solution.

I compiled the code in the ticket with loop-versioning, which makes this code for the "version where stride matches":

.LBB0_17:
	movslq	%r13d, %r13
	leaq	(%rdx,%r13), %r12
	decq	%r12
	movups	(%r9,%r12,4), %xmm0
	leaq	-1(%r13,%r10), %r12
	movups	(%r14,%r12,4), %xmm1
	leaq	-1(%r13,%r11), %r12
	movups	(%r15,%r12,4), %xmm2
	mulps	%xmm1, %xmm2
	addps	%xmm0, %xmm2
	leaq	(%rbx,%r13), %r12
	decq	%r12
	movups	%xmm2, (%r8,%r12,4)
	addl	$4, %r13d
	addq	$-4, %rsi
	jne	.LBB0_17

So, that's using vector instructions.

WIth loop versioning turned off:

.LBB0_4:
	movslq	%r11d, %r11
	leaq	-1(%r11), %rax
	movq	%r15, %rdx
	imulq	%rax, %rdx
	addq	%r10, %rdx
	movss	(%rbp,%rdx), %xmm0
	movq	%rsi, %rdx
	imulq	%rax, %rdx
	addq	%rdi, %rdx
	mulss	(%rcx,%rdx), %xmm0
	movq	-16(%rsp), %rdx
	imulq	%rax, %rdx
	addq	%r13, %rdx
	movq	-8(%rsp), %r12
	addss	(%r12,%rdx), %xmm0
	imulq	%r8, %rax
	addq	%r14, %rax
	movss	%xmm0, (%r9,%rax)
	incl	%r11d
	decq	%rbx
	cmpq	$1, %rbx
	ja	.LBB0_4

This is also good example of how the code is simplified for the access to the arrays - there is no complicated index calculatiuons in the loop. [This is partly because the array is flattened into a 1D array, and partly because we give the compiler is clearer idea of what is going on]. Just basic "count number of instructions" shows that the "versioned" loop has 16 instructiions, vs 22 in the non-versioned loop. I haven't counted clock-cycles per instruction, but just eyeballing it, I'd say it's even more of a difference - and of course, since it was successfully vectorizing, it's ALSO faster because it runs a quarter as many loops.

Yes, I've only spent SOME time trying to understand why vectorizer doesn't - it basically comes down to "can't figure out how the stride is calculated, and whether it may change over time". The loop versioning here is helping to solve that for the lower layers in the compiler - and in my experience (I've written quite a few different style loops and such), either some MLIR pass(es) manages to vectorize the codr, or LLVM loop vectorizer can't do it either. [Why is there two layers of vectorizers? I don't know - presumably people like to write code that does similar things on multiple levels].

OK.

Also, the overall goal isn't ONLY to vectorize, but also to "improve overall loop performance". It may be that we can convince the compiler to do a better job in some other way, but right now, I'm not aware of what that should be. The fundamental problem is that the descriptors used to pass array information is getting in the way of lots of things. It should be noted that although we see performance improvement in roms_r, it does NOT come from vectorization of the loop constructs - both gfortran and flang uses loop versioning, and gets similar performance improvement - and neither compiler uses vector instructions in the hot loops in ROMS.

I am confused about how the performance gets improved for roms_r you mentioned. From the test in this patch, I can only see the optimization opportunity of loop vectorization after loop versioning. Maybe another test case extracted in roms_r showing how it gets performance improvement is better.

Because when the access to a 1D !fir.ref<!fir.array<*>> is simpler than the access of !fir.box<!fir.array<*>>. See example in the comment with a link to "related ticket".

BTW, can you attach the link of discussion in discourse in commit message.

Will do.

Thanks @MatsPetersson for working on this. I have started going through the patch and have some comments.

flang/include/flang/Tools/CLOptions.inc
185

Nit: No need for braces here.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2–3

Nit: This style is probably does not match the usual one.

13

Nit: I think you mean "looks for loops iterating over assumed-shape arrays".

79

Nit: spell out auto.

81

Nit: Could you provide some documentation for what this structure is for?

84

Nit: Why 6?

89

Nit: spell out auto.

96

Nit: Add a comment for the rank 3 requirement.

213

Nit: Add a comment on why reduction to single index is required. If any issues were faced when retaining the multiple dimensions, then please call out.

245
flang/test/Transforms/loop-versioing.fir
76 ↗(On Diff #503460)
116 ↗(On Diff #503460)

Nit: newline

flang/include/flang/Optimizer/Transforms/Passes.td
263
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
64

Nit: spell out auto.

108

Nit: Should this be "Too many dimensions" or "Unsupported type"?

124

Nit: Is this assert really required?

126

Nit: expand auto.

129

Do you mean op2->getParentOfType<fir::DoLoopOp>() is an inner loop contained in loop?

136

Is there a better datastructure for argsOfInterest? like a set or something?

156

Nit: Please be more specific in the comment.

169

Nit: Add a comment saying something like "we have to ensure that all the assumed shape arguments used in the loop have unit-stride, so create a predicate that checks the stride of all of them".

172

Nit: expand auto

186–190

Nit: Braces required here.

217–231

Is using the mlir::indexType necessary here (and probably in other places)? I see more converts in the simplified version that in the original one.

245

Nit:The message in the assertion is too generic. Please replace with something specific to the transformation, like did not find any co-ordinate ops.

I failed to apply this patch locally.

I failed to apply this patch locally.

I think it needs rebasing. I will do that today, but I'm not sure whether I can fix all the comments from Kiran before I go on leave for two weeks. Most are trivial, but still takes time (and I've got a bug that I've been working on the past week that is higher priority).

I failed to apply this patch locally.

I have updated to the latest llvm/main branch, so should apply now.

Note that I have intentionally NOT fixed any of the review comments, I didn't want to fix a small number and leave the rest unresolved. I will fix that when I'm back from my break.

peixin added inline comments.Mar 20 2023, 1:44 AM
flang/include/flang/Tools/CLOptions.inc
223–224

I still failed to apply this patch. There is no such code in current main branch.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

Thanks @peixin for the comment. Mats is away and is only back in April.
We discussed what other compilers do in https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447. As you can imagine and as pointed out in one of the comments in the discourse thread, always creating copies only benefits if the called subroutine can take advantage of it and can amortise the cost of creating the copy.

Thanks @peixin for the comment. Mats is away and is only back in April.

OK. We can continue the discussion when Mats is back.

Leporacanthicus marked 12 inline comments as done.
  • Rebase
  • Fix review comments
Leporacanthicus added inline comments.Apr 4 2023, 5:24 AM
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
108

Probably "both". :)

136

Maybe. I don't expect there to be MANY of these, and we would have to declare a comparison operators for the argInfo to make that work. Not convinced using llvm::SmallSet (or mlir::SmallSet is that useful).

186–190

Not sure if I put braces in the right place, but I've added some... :)

217–231

Without the converts, the code won't compile correctly - it gives a weird error message about cannot be converted to LLVM IR: missing LLVMTranslationDialectInterface registration for dialect for op: <something>

It should only add conversion if the types actually need converting. I guess it would be possible to try to reduce the conversions, but it would eventually need conversion to index type anyway.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

flang/include/flang/Tools/CLOptions.inc
223–224

I'm not sure exactly which part you are referring to "not in main" - there is a dependent patch in D141307 that introduces the loopVersioning option itself - is that what's missing, or am I completely missing something?

Reference this:
https://github.com/llvm/llvm-project/blob/main/flang/include/flang/Tools/CLOptions.inc#L285

flang/test/Transforms/loop-versioing.fir
76 ↗(On Diff #503460)
116 ↗(On Diff #503460)

Doh, missed the test comments. Have fixed locally, will update later.

SBallantyne added inline comments.Apr 4 2023, 8:56 AM
flang/test/Transforms/loop-versioing.fir
1 ↗(On Diff #510757)

Nit: Typo, filename should loop-versioning.fir

peixin added a comment.Apr 4 2023, 9:00 AM

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

Sorry. This case should be special case, and can be the future work.

This work is a good start, and I am inspired by it. Thanks.

peixin added a comment.Apr 4 2023, 9:06 AM

Strange. One of my inlined comment is lost.

peixin added inline comments.Apr 4 2023, 9:16 AM
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
13

I think there should be more descriptions about purpose, considerations, and TODOs
.

  1. This pass actually does two works: prepare for loop vectorize and expression sp

ecialization where the expression is array stride.

For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future.
For expression specialization, extending to more dimensions and other arrays sho

uld be future work. The unit stride should be common in Fortran.

  1. There should be one cost model in future, which can be described as TODO. I tri

ed the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call
in the loop, gfortran does not do the loop versioning. If there is an if statement
, gfortran does it. It seems that gfortran does not perform the loop versioning un
conditionally.

  1. The code size problem should also be considered. What about the option -Os/Oz?

What about if the loop is very large?

  1. Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do yo

u consider doing the loop versioning at the end of LLVM middle-end optimization pa
ss? If so, classic-flang may also benefit from it considering LLVM Flang is still
not ready for commercial usage.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

<snip code sample>

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

I combined my existing benchmark for 1D loop versioning (here: https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605/2), so that we run the vadd 250000 times in a loop, with a 4000 element array size with b as unit-stride (contiguous) and b as 2 * unit stride. The result with loop versioning, when the array is unit-stride is 0.33s, with non-unit stride it is 1.2s [for my x86-64 desktop machine]. Without loop versioning, all three variants take about 1.2s. That makes the vectorized (thanks to loop versioning) version 73% faster (or 3.6x faster).

I also ran the same benchmark, with a size of 10 instead of 4000 (and correspondingly more loops to . The result is STILL better for the versioned loop, but much lower than the large array (for many reasons, loop overhead, not able to use vector instructions for all elements, etc). With loop versioning: 0.76s for unit-stride, 1.2s for non-unit stride. Without loop versioning, the result is 1.2s for both scenarios (well, all three cases - but two of those are identical, just passing b and c into the function). For the benchmarks I did, I couldn't get a measurable slow-down for the versioned loop in the non-unit-stride variant - it may be measurable for a different case. [The call does get inlined in this benchmark, someting I actively tried to avoid in my original benchmarks].

Here's the initial, 4000 element version of the code:

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end subroutine vadd

subroutine do_bench(msg, cc, aa, bb, size)
  interface
     subroutine vadd(c, a, b, n)
       implicit none
       integer :: n
       real :: c(:), a(:), b(:)
     end subroutine vadd
  end interface
  character(*) :: msg
  integer, parameter :: loops = 250000
  integer:: size
  real :: aa(1:)
  real :: bb(1:)
  real :: cc(1:)
    
  real*8 :: time, time_start, time_end
  integer::i

  call CPU_TIME(time_start)
  do i = 1, loops
     call vadd(cc, aa, bb, size)
  end do
  call CPU_TIME(time_end)
  time = time_end - time_start
  print "(A12, F8.5, A2)", msg, time, " s"
end subroutine do_bench

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end subroutine vadd

  subroutine do_bench(msg, cc, aa, bb, size)
    character(*) :: msg
    integer:: size
    real :: aa(1:)
    real :: bb(1:)
    real :: cc(1:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.

  integer, parameter :: size = 4000
  real :: aa(size)
  real :: bb(size)
  real :: cc(size * 2)
  real :: res(size)
  
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c

  aa = 1
  bb = 2
  cc = 3
  res = 0

  call do_bench("a + b", res, aa, bb, size)
  call do_bench("a + c", res, aa, cc, size)
  call do_bench("a + c(::2)", res, aa, cc(::2), size)
  
end

For the 10 element version, just replace the looops = 250000 with loops = 250000 * 400, and size = 4000 with size = 10. Number of loops is choosen to make a "reasonable amount of time on my machine" - I try to aim for benchmarks of this kind that run around 0.5-1.5s, and I just used the values from my original benchmark to start with,.

Yes, the vadd code is definitely quite a bit longer - about 3.4 times (428 vs 126 bytes) for my x86-64 code - this will clearly vary depending on the processor architecture, so Arm may get a different result. Much of that is due to the fact that the loop also gets vectorized, which expands the original code a fair bit, because THAT also has to include the code to resolve the last few elements if the size isn't a multiple of vector size. Being able to vectorize loops is definitely one of the goals here.

As with pretty much any optimisation, not everything will be positive. In this case, larger code-size, and possibly also some cases of "marginally slower". I've not found any case where it is actually slower, but I'm sure there are cases where this happens. [A couple of years ago, I was working on a hobby project compiler, where I called a runtime function to count number of bits "set" in an array of i32 elements, which, where the compiler decided that "oh, you probably do this on large arrays, so I'll vectorize the loop" - for the application that I was working on, the size of the array was always 1 - so it was bady punished by the "check if we can use vector instructions" - I can't remember if I tried to fix the function, or went straight to "form the correct code inline instead" - now that project has an inline count bits, either way].

(I have seen the reply to my previous comment, but since I'd almost written all this already, I completed it)

Leporacanthicus marked 4 inline comments as done.
  • Change wording in several comments.
  • Change one "auto" to real type.
  • Rename misspelled filename and add newline at the end of the file
Leporacanthicus marked an inline comment as done.Apr 6 2023, 7:49 AM
Leporacanthicus added inline comments.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
13

Thanks a lot for your in depth comments. I have tried my best to answer below, and updated some of the comment explanation to the pass.

I think there should be more descriptions about purpose, considerations, and TODOs
.

  1. This pass actually does two works: prepare for loop vectorize and expression specialization where the expression is array stride.

I have added more text t to the description of the pass in general, describing some of what we've discussed in various comments on this review, which should make it clearer what the pass does, why it is needed, etc.

For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future.

From what I can tell, the problem is that loop vectorizer doesn't understand how Fortran descriptors work, and just sees them as structures that contain some data used by the loop - which makes it decide that "I have no idea what the stride is, so I can't vectorize this". I looked into, but not VERY hard, solving the loop vectorizer problem, but I couldn't get my head around how you'd figure out when to try to do this and when not. In FIR, we can know this is a descriptor for an array.

  1. There should be one cost model in future, which can be described as TODO.

Thanks, I have added a TODO to that effect.

I tried the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call
in the loop, gfortran does not do the loop versioning. If there is an if statement, gfortran does it. It seems that gfortran does not perform the loop versioning unconditionally.

I'm not entirely sure why you wouldn't do this just because there is a call in there. Do you know why it doesn't? Of course, if someone is calling, say, print *, a (where a is a 1000+ element array), it's probably not a great ideal to transform that loop.

I have added a TODO to this effect.

  1. The code size problem should also be considered. What about the option -Os/Oz?

What about if the loop is very large?

I believe -Os/-Oz is not currently supported in Flang (-Os didn't work when I just tried it, and I'm sure there was a discourse thread on it just a few days back), but the code for that (in D141307, file clang/lib/Driver/ToolChains/Flang.cpp lines 69-76) should return false for "not a number". It is only enabled for > O2 [along with -Ofast] - or if someone specifically asks for the -fversion-loops-for-stride.

A limit for large loops is probably not a bad thing. Where to set that limit, I haven't got much of an idea...

  1. Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do you consider doing the loop versioning at the end of LLVM middle-end optimization pass? If so, classic-flang may also benefit from it considering LLVM Flang is still not ready for commercial usage.

As discussed above, it's the problem of "understanding what the code does" that I talked about above.

Classic flang relies on the "copy to a contiguous array", so there wouldn't be any benefit [unless we remove that part - I have no idea how easy/hard it would be to do that].

129

Rewritten the comment to make it clearer.

Update moved header file and include kindmapping that we rely on.

Add size to SmallVector to allow Windows build.

peixin added a comment.Apr 7 2023, 2:20 AM

Thanks for the update. The whole design looks good to me now. Thanks for the work.

flang/include/flang/Optimizer/Transforms/Passes.td
263

nit: American english is suggested. You used it in LoopVersioning.cpp.

265

American english is suggested.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2

nit

17
27

Is this one editor problem? wield indentation, also for the following.

29
Leporacanthicus marked 4 inline comments as done.
  • Review comment updates. All are comment wording and formatting, no actual code changes
  • Rebase, which includes trivial conflict in Passes.td
flang/include/flang/Optimizer/Transforms/Passes.td
263

Having spent far too much time reading and writing a mix of British and American English, I never quite know which I'm actually using - or should use... ;)

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2

I removed the brief description - there's an actual description further down, and I don't think there's much point in having a one-line [that isn't quite one line] description too. There is a mixture of the two approaches, but I think reading the long comment 10 lines further down is good enough.

2–3

I have changed it, removing the "too long" description altogether, see above.

27

No, it's me running clang-format and not spotting that it's modified some comments to not look good. ;)

[I guess it would be POSSIBLE to set the editor to break lines too! Or to run clang-format on a region]

Leporacanthicus marked 5 inline comments as done.Apr 14 2023, 10:00 AM
Leporacanthicus added inline comments.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
195–198

Not done this.

Leporacanthicus marked 10 inline comments as done.Apr 14 2023, 10:14 AM
Leporacanthicus added inline comments.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
136

I considered this, but you at least need a custom operator< for that to work, and I don't think it provides a real benefit.

Fix unnecessary curly braces

LGTM. Thanks for working on this and addressing all the comments.

I am requesting further tests, Nits and also some additional checks to avoid failures when loops are generated.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
37
40
155

Nit: Would loopsOfInterest be a better name?

160

Nit: Why is this called op2 and not just op?

203–226

Nit: Please add a test for comparing the dimension of more than one array.

248

Nit: we need a test for the multi-dimension to single dimension reduction.

285

Nit: Can we add a comment for why this ResultOP is needed here and below? Propagate the results from the If?

I think there are cases where there are no results in the loop, it might not be correct to insert one here in that case. This happens in cases where the loop is generated.

subroutine test3(x, y)
  integer :: y(:)
  integer :: x(:)
  read(*,*) x(y)
end subroutine
288

Nit: Could you check the following test?

subroutine test4(a, b, n1, m1)
  real :: a(:)  
  real :: b(:,:)
          
  a = [ ((b(i,j), j=1,n1,m1), i=1,n1,m1) ]
end subroutine test4
flang/test/Transforms/loop-versioning.fir
109–113 ↗(On Diff #513714)

Nit: Would it be better to check that there are not two fir.do_loop?

Leporacanthicus marked 4 inline comments as done.

Review comment changes:

  • Add checks for "no result" from the operations
  • Fix code generated for array initializers
  • Change the names of op2 -> op, and opsOfInterest -> loopsOfInterest
  • Some typos in comments fixed
Leporacanthicus marked 2 inline comments as done.Apr 17 2023, 2:24 PM
Leporacanthicus marked an inline comment as done.Apr 17 2023, 2:37 PM
This revision was automatically updated to reflect the committed changes.