This is an archive of the discontinued LLVM Phabricator instance.

Hi @Leporacanthicus, thank you for working on this! I wonder if you target any particular benchmark with these changes - can you please post any information in the description?

While I was comparing different compilers on SPEC CPU2000, 2006 and 2017 and Polyhedron, I believe I only saw one case where the missing "continuity" information affected a hotspot - this was in CPU2000/187.facerec, and the estimated improvement was ~14%. This was in graphRoutines.f90::LocalMove where the 2-level loop nest for the multiplication+reduction could be collapsed and vectorized with unit stride:

 274         Subroutine LocalMove (Params, Position, Graph, GaborTrafo, &
 275                                           Similarity, Hops, Sweeps)
...
 289         Real(4),        Intent(In)    :: Graph (:, :, :)
 290         Real(4),        Intent(In)    :: GaborTrafo (:, :, :, :)
...
 392                                                         NewJetSim = SUM (Graph (:, :, IG) * &
 393         &                                                GaborTrafo (:, :, NewX, NewY))

I wonder if you found more cases where this optimization may have significant impact.

I think you want to replace "assumed size array" with "assumed shape array" in the comments, since the assumed-size arrays should be contiguous by definition.

In D141306#4037682, @vzakhari wrote:

Hi @Leporacanthicus, thank you for working on this! I wonder if you target any particular benchmark with these changes - can you please post any information in the description?

This is work described here:
https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447/3

As mentioned, this (we believe) will benefit the Spec2017FP roms_r benchmark - probably others too.

At the moment, it's not really working even for simple cases, so that's a bit of an issue, but hopefully we can get there... :)

In D141306#4041263, @Leporacanthicus wrote:

In D141306#4037682, @vzakhari wrote:

Hi @Leporacanthicus, thank you for working on this! I wonder if you target any particular benchmark with these changes - can you please post any information in the description?

This is work described here:
https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447/3

Thank you for the link!

Added "move remaining loop into else branch" and related changes.

Still doesn't actually work as intended.

Harbormaster completed remote builds in B207590: Diff 488938.Jan 13 2023, 4:02 AM

Rebase to latest llvm/main.

Harbormaster completed remote builds in B207591: Diff 488939.Jan 13 2023, 4:13 AM

Rebase again

Harbormaster completed remote builds in B207605: Diff 488956.Jan 13 2023, 5:33 AM

Some progress, now identifying the correct loops a little better.

Support for 2D arrays, but not "any dimensions".

This patch does not work correctly, it needs further work.

Harbormaster completed remote builds in B211052: Diff 493702.Jan 31 2023, 12:11 PM

Updated patch, now works for 1D and 2D arrays.

Harbormaster completed remote builds in B211448: Diff 494251.Feb 2 2023, 4:10 AM

peixin added a subscriber: peixin.Feb 3 2023, 6:21 AM

peixin added inline comments.

flang/include/flang/Optimizer/Transforms/Passes.td
279

Update to remove WIP

Rebased to latest llvm/main

Still restricted to 2D arrays

Harbormaster completed remote builds in B214192: Diff 498066.Feb 16 2023, 11:22 AM

Add tests.

Fix bug preventing 1D array being handled.

Harbormaster completed remote builds in B216991: Diff 501885.Mar 2 2023, 9:01 AM

Leporacanthicus retitled this revision from WIP LoopVersioning to Add loop-versioning pass to improve unit-stride.Mar 3 2023, 2:42 AM

Leporacanthicus edited the summary of this revision. (Show Details)

tblah added a subscriber: tblah.Mar 3 2023, 6:38 AM

Leporacanthicus added reviewers: vzakhari, peixin, tblah.Mar 6 2023, 7:10 AM

Thanks for implementing this.

Why are only function arguments considered? What about unknown sized arrays from other sources?

flang/include/flang/Optimizer/Transforms/Passes.td
293
flang/include/flang/Tools/CLOptions.inc
205	nit: I think this should be wrapped in LLVM_DEBUG() and be written to llvm::dbgs()
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
72	See `hlfir::getFortranElementOrSequenceType` in `HLFIRDialect.h`. If you want to explicitly not support hlfir::Expr, you can still unwrap the box using `fir::unwrapPassByRefType(fir::unwrapRefType(type))`. I think mlir::Types can be null so the `std::optional` isn't needed
111	This should be inside of `LLVM_DEBUG`
126	This can be written as `func.walk([&](fir::DoLoopOp loop) {`. Then the if statement isn't required.
143	I think we can break here because if this argument of interest was used as operand 0 to the coordinate op, later arguments cannot be the same argument.
195–198	I don't mind whether you decide to change this or not, but there's a convenience interface for this: see `FirOpBuilder::IfBuilder`
200	nit: `LLVM_DEBUG`
251	`LLVM_DEBUG`

Updated to reflect review comments.

Use different type for Walk functions.
Use "unwrap" functions to get to types.
Fix some debug output and related stuff.
Some other minor changes, including clang-format fixups.

Leporacanthicus added inline comments.Mar 8 2023, 11:39 AM

flang/include/flang/Tools/CLOptions.inc
205	Removing, it was left behind when I wasn't sure if the pass was being run or not. None of the other passes print when they are created.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
72	Removed the optional. I've simplified the code using the unwrap functions. I'm not sure there is any benefit in bringing HLFIR into this - we shouldn't have HLFIR types here, should we? It also looks like the HLFIR code creates a NEW type, which seems a bit unnecessary.
126	Also useful in a few other places! :)
251	I changed it to an assert - it ought to not happen! ;) We probably want to know if it DOES happen!

Harbormaster completed remote builds in B218171: Diff 503460.Mar 8 2023, 11:51 AM

tblah added inline comments.Mar 9 2023, 4:03 AM

flang/test/Transforms/loop-versioing.fir
53 ↗	(On Diff #503460)	What if there is a non-default lower bound in #[[DIMS]]#0? Won't removing the box cause the lower bound to be ignored?

I am thinking if we should do the loop versioning in MLIR since this may increase compilation time in optimization passes.

The loop vectorize pass can support the versioning in some cases. Check this case: https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopVectorize/version-mem-access.ll. It is SCEV analysis combined with runtime check (https://github.com/llvm/llvm-project/blob/c11c2f5f6548a5303517c89ba6629bbfa7fae0d9/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L1943-L1973). The original support is one C example similar to Fortran code: https://github.com/llvm/llvm-project/commit/c2e9d759f29602021daa26453452928c81adffbb.

I am not against this method for now since supporting this in Loop vectorization pass is not an easy work.

flang/include/flang/Optimizer/Transforms/Passes.td
294	I am curious what other optimizations would benefit from this?

In D141306#4192692, @peixin wrote:

I am thinking if we should do the loop versioning in MLIR since this may increase compilation time in optimization passes.

I have not noticed any measurable increase in compile time - Spec 2017 wrf_r takes within 1-2 seconds, and that's the one that takes about 15 minutes to build in total. I'm not saying it's impossible to come up with something that compiles slowly with this code, but I've made as good an attempt as I can to "exit early" if there's nothing that needs doing, and only duplicate the innermost loop [which potentially is not the most optimal case].

The loop vectorize pass can support the versioning in some cases. Check this case: https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopVectorize/version-mem-access.ll. It is SCEV analysis combined with runtime check (https://github.com/llvm/llvm-project/blob/c11c2f5f6548a5303517c89ba6629bbfa7fae0d9/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L1943-L1973). The original support is one C example similar to Fortran code: https://github.com/llvm/llvm-project/commit/c2e9d759f29602021daa26453452928c81adffbb.

I am not against this method for now since supporting this in Loop vectorization pass is not an easy work.

The problem I'm trying to solve here is that the Loop Vectorizer optimisation doesn't "like" the descriptor access to read the strides [and other descriptor accesses] - it doesn't understand that the value is a constant for a given run of a loop. With versioning , it basically makes the Loop Vectorizer "happy" with the loop, so it can do its own thing.

flang/include/flang/Optimizer/Transforms/Passes.td
294	Just generally "making loops simpler" (less indirect accesses), which allows the compiler to do a better job - the "other loop optimisatioins" is not specific passes or optimisation steps as such, just "improvement in the overall code being produced".
flang/test/Transforms/loop-versioing.fir
53 ↗	(On Diff #503460)	I'm not entirely sure how that would end up being the case for the type of code we're processing here. I tried writing some code that used an array `b(12:21) instead of` b(10), and it goes wrong if I use the `do i = 12, 21` in the loop in the called function, it assumes that `b(:)` is an array starting with 1, not 12, even though the original array is `12:21`.

I have not noticed any measurable increase in compile time - Spec 2017 wrf_r takes within 1-2 seconds, and that's the one that takes about 15 minutes to build in total. I'm not saying it's impossible to come up with something that compiles slowly with this code, but I've made as good an attempt as I can to "exit early" if there's nothing that needs doing, and only duplicate the innermost loop [which potentially is not the most optimal case].

Great! This sounds good to me.

The problem I'm trying to solve here is that the Loop Vectorizer optimisation doesn't "like" the descriptor access to read the strides [and other descriptor accesses] - it doesn't understand that the value is a constant for a given run of a loop. With versioning , it basically makes the Loop Vectorizer "happy" with the loop, so it can do its own thing.

I think the loop vectorize pass is doing the similar thing. Check the generated LLVM IR the following Fortran case:

subroutine sum1d(a, n)
  real*8 :: a(:)
  integer :: n
  real*8 :: sum
  integer :: i
  sum = 0
  do i=1,n
     sum = sum + a(i)
  end do
  call temp(sum)
end subroutine sum1d

$ flang-new -fc1 -emit-llvm -O3 test.f90
$ cat test.ll
define void @sum1d_(ptr nocapture readonly %0, ptr nocapture readonly %1) local_unnamed_addr {
  %3 = alloca double, align 8
  store double 0.000000e+00, ptr %3, align 8, !tbaa !1
  %4 = load i32, ptr %1, align 4, !tbaa !1
  %5 = icmp sgt i32 %4, 0
  br i1 %5, label %.lr.ph, label %._crit_edge

.lr.ph:                                           ; preds = %2
  %6 = zext i32 %4 to i64
  %7 = load ptr, ptr %0, align 8, !tbaa !5
  %8 = getelementptr { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %0, i64 0, i32 7, i64 0, i64 2
  %9 = load i64, ptr %8, align 8, !tbaa !5
  br label %10

10:                                               ; preds = %.lr.ph, %10
  %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %10 ]
  %11 = phi double [ 0.000000e+00, %.lr.ph ], [ %16, %10 ]
  %12 = add nsw i64 %indvars.iv, -1
  %13 = mul i64 %9, %12
  %14 = getelementptr i8, ptr %7, i64 %13                                                            ! Note: This has the similar structure as the LLVM IR generated from the following C code.
  %15 = load double, ptr %14, align 8, !tbaa !1
  %16 = fadd contract double %11, %15
  store double %16, ptr %3, align 8, !tbaa !1
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv, %6
  br i1 %exitcond.not, label %._crit_edge, label %10

._crit_edge:                                      ; preds = %10, %2
  call void @temp_(ptr nonnull %3)
  ret void
}

When you compare it with the generated IR from the following C case using clang case.c -S -emit-llvm -O3:

void temp(double *sum);

void test1(double * restrict A, double * restrict B, int N, int Stride) {
  int i;
  for (i = 0; i < N; ++i)
    B[i * Stride] += A[i * Stride];
}

void test2(double * restrict A, int N, int Stride) {
  int i;
  double sum = 0.0;
  for (i = 0; i < N; ++i)
    sum += A[i * Stride];
  temp(&sum);
}

The loop in function test1 can be vectorized using stride-check loop versioning. I don't dig why test2 fail in vectorizing the reduction. Anyway, the loop vectorize pass can do some similar work. But I am not sure if the SCEV analysis there can handle the Fortran case.

In D141306#4195254, @peixin wrote:
I have not noticed any measurable increase in compile time - Spec 2017 wrf_r takes within 1-2 seconds, and that's the one that takes about 15 minutes to build in total. I'm not saying it's impossible to come up with something that compiles slowly with this code, but I've made as good an attempt as I can to "exit early" if there's nothing that needs doing, and only duplicate the innermost loop [which potentially is not the most optimal case].

Great! This sounds good to me.

The problem I'm trying to solve here is that the Loop Vectorizer optimisation doesn't "like" the descriptor access to read the strides [and other descriptor accesses] - it doesn't understand that the value is a constant for a given run of a loop. With versioning , it basically makes the Loop Vectorizer "happy" with the loop, so it can do its own thing.

I think the loop vectorize pass is doing the similar thing. Check the generated LLVM IR the following Fortran case:
subroutine sum1d(a, n)
  real*8 :: a(:)
  integer :: n
  real*8 :: sum
  integer :: i
  sum = 0
  do i=1,n
     sum = sum + a(i)
  end do
  call temp(sum)
end subroutine sum1d
$ flang-new -fc1 -emit-llvm -O3 test.f90
$ cat test.ll
define void @sum1d_(ptr nocapture readonly %0, ptr nocapture readonly %1) local_unnamed_addr {
  %3 = alloca double, align 8
  store double 0.000000e+00, ptr %3, align 8, !tbaa !1
  %4 = load i32, ptr %1, align 4, !tbaa !1
  %5 = icmp sgt i32 %4, 0
  br i1 %5, label %.lr.ph, label %._crit_edge

.lr.ph:                                           ; preds = %2
  %6 = zext i32 %4 to i64
  %7 = load ptr, ptr %0, align 8, !tbaa !5
  %8 = getelementptr { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %0, i64 0, i32 7, i64 0, i64 2
  %9 = load i64, ptr %8, align 8, !tbaa !5
  br label %10

10:                                               ; preds = %.lr.ph, %10
  %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %10 ]
  %11 = phi double [ 0.000000e+00, %.lr.ph ], [ %16, %10 ]
  %12 = add nsw i64 %indvars.iv, -1
  %13 = mul i64 %9, %12
  %14 = getelementptr i8, ptr %7, i64 %13                                                            ! Note: This has the similar structure as the LLVM IR generated from the following C code.
  %15 = load double, ptr %14, align 8, !tbaa !1
  %16 = fadd contract double %11, %15
  store double %16, ptr %3, align 8, !tbaa !1
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv, %6
  br i1 %exitcond.not, label %._crit_edge, label %10

._crit_edge:                                      ; preds = %10, %2
  call void @temp_(ptr nonnull %3)
  ret void
}
When you compare it with the generated IR from the following C case using clang case.c -S -emit-llvm -O3:
void temp(double *sum);

void test1(double * restrict A, double * restrict B, int N, int Stride) {
  int i;
  for (i = 0; i < N; ++i)
    B[i * Stride] += A[i * Stride];
}

void test2(double * restrict A, int N, int Stride) {
  int i;
  double sum = 0.0;
  for (i = 0; i < N; ++i)
    sum += A[i * Stride];
  temp(&sum);
}
The loop in function test1 can be vectorized using stride-check loop versioning. I don't dig why test2 fail in vectorizing the reduction. Anyway, the loop vectorize pass can do some similar work. But I am not sure if the SCEV analysis there can handle the Fortran case.

Yes, I've only spent SOME time trying to understand why vectorizer doesn't - it basically comes down to "can't figure out how the stride is calculated, and whether it may change over time". The loop versioning here is helping to solve that for the lower layers in the compiler - and in my experience (I've written quite a few different style loops and such), either some MLIR pass(es) manages to vectorize the codr, or LLVM loop vectorizer can't do it either. [Why is there two layers of vectorizers? I don't know - presumably people like to write code that does similar things on multiple levels].

Also, the overall goal isn't ONLY to vectorize, but also to "improve overall loop performance". It may be that we can convince the compiler to do a better job in some other way, but right now, I'm not aware of what that should be. The fundamental problem is that the descriptors used to pass array information is getting in the way of lots of things. It should be noted that although we see performance improvement in roms_r, it does NOT come from vectorization of the loop constructs - both gfortran and flang uses loop versioning, and gets similar performance improvement - and neither compiler uses vector instructions in the hot loops in ROMS.

Also, if you do something like:

template <typename T>
struct descr
{
     size_t stride;
     T* ptr;
};

void sum1d(restrict descr<double>* a, int n)
{
   int i;
  double sum = 0.0;
  for (i = 0; i < N; ++i)
    sum += a->ptr[i * a->Stride];
  temp(&sum);
}

The compiler (clang and gcc both) will generate "different" code than the plain passing a pointer + stride. For simple cases, where all the code is "visible" and the compiler can inline and propagate stuff, but if it's compiled separately, it "doesn't know what's going on, and gives up".

Leporacanthicus added a reviewer: SBallantyne.Mar 15 2023, 7:47 AM

LGTM, but someone else should approve. Only comment is that the fir test is currently called loop-versioing.fir. Should probably be loop-versioning.fir?

LGTM, thanks for addressing my comments and answering my questions.

This revision is now accepted and ready to land.Mar 15 2023, 8:29 AM

Yes, I've only spent SOME time trying to understand why vectorizer doesn't - it basically comes down to "can't figure out how the stride is calculated, and whether it may change over time". The loop versioning here is helping to solve that for the lower layers in the compiler - and in my experience (I've written quite a few different style loops and such), either some MLIR pass(es) manages to vectorize the codr, or LLVM loop vectorizer can't do it either. [Why is there two layers of vectorizers? I don't know - presumably people like to write code that does similar things on multiple levels].

OK.

Also, the overall goal isn't ONLY to vectorize, but also to "improve overall loop performance". It may be that we can convince the compiler to do a better job in some other way, but right now, I'm not aware of what that should be. The fundamental problem is that the descriptors used to pass array information is getting in the way of lots of things. It should be noted that although we see performance improvement in roms_r, it does NOT come from vectorization of the loop constructs - both gfortran and flang uses loop versioning, and gets similar performance improvement - and neither compiler uses vector instructions in the hot loops in ROMS.

I am confused about how the performance gets improved for roms_r you mentioned. From the test in this patch, I can only see the optimization opportunity of loop vectorization after loop versioning. Maybe another test case extracted in roms_r showing how it gets performance improvement is better.

BTW, can you attach the link of discussion in discourse in commit message.

peixin added reviewers: jeanPerier, clementval, klausler.Mar 15 2023, 6:19 PM

FYI, another related scenario is https://github.com/llvm/llvm-project/issues/59388.

I also faced one case having loop interchange opportunity, which is blocked by the Fortran array stride information not understood by SCEV analysis. I think this is one common issue in loop optimization. So there maybe need some discussions whether loop versioning in MLIR pass is the right solution.

In D141306#4198192, @peixin wrote:

FYI, another related scenario is https://github.com/llvm/llvm-project/issues/59388.

I also faced one case having loop interchange opportunity, which is blocked by the Fortran array stride information not understood by SCEV analysis. I think this is one common issue in loop optimization. So there maybe need some discussions whether loop versioning in MLIR pass is the right solution.

I compiled the code in the ticket with loop-versioning, which makes this code for the "version where stride matches":

.LBB0_17:
	movslq	%r13d, %r13
	leaq	(%rdx,%r13), %r12
	decq	%r12
	movups	(%r9,%r12,4), %xmm0
	leaq	-1(%r13,%r10), %r12
	movups	(%r14,%r12,4), %xmm1
	leaq	-1(%r13,%r11), %r12
	movups	(%r15,%r12,4), %xmm2
	mulps	%xmm1, %xmm2
	addps	%xmm0, %xmm2
	leaq	(%rbx,%r13), %r12
	decq	%r12
	movups	%xmm2, (%r8,%r12,4)
	addl	$4, %r13d
	addq	$-4, %rsi
	jne	.LBB0_17

So, that's using vector instructions.

WIth loop versioning turned off:

.LBB0_4:
	movslq	%r11d, %r11
	leaq	-1(%r11), %rax
	movq	%r15, %rdx
	imulq	%rax, %rdx
	addq	%r10, %rdx
	movss	(%rbp,%rdx), %xmm0
	movq	%rsi, %rdx
	imulq	%rax, %rdx
	addq	%rdi, %rdx
	mulss	(%rcx,%rdx), %xmm0
	movq	-16(%rsp), %rdx
	imulq	%rax, %rdx
	addq	%r13, %rdx
	movq	-8(%rsp), %r12
	addss	(%r12,%rdx), %xmm0
	imulq	%r8, %rax
	addq	%r14, %rax
	movss	%xmm0, (%r9,%rax)
	incl	%r11d
	decq	%rbx
	cmpq	$1, %rbx
	ja	.LBB0_4

This is also good example of how the code is simplified for the access to the arrays - there is no complicated index calculatiuons in the loop. [This is partly because the array is flattened into a 1D array, and partly because we give the compiler is clearer idea of what is going on]. Just basic "count number of instructions" shows that the "versioned" loop has 16 instructiions, vs 22 in the non-versioned loop. I haven't counted clock-cycles per instruction, but just eyeballing it, I'd say it's even more of a difference - and of course, since it was successfully vectorizing, it's ALSO faster because it runs a quarter as many loops.

In D141306#4198175, @peixin wrote:

Yes, I've only spent SOME time trying to understand why vectorizer doesn't - it basically comes down to "can't figure out how the stride is calculated, and whether it may change over time". The loop versioning here is helping to solve that for the lower layers in the compiler - and in my experience (I've written quite a few different style loops and such), either some MLIR pass(es) manages to vectorize the codr, or LLVM loop vectorizer can't do it either. [Why is there two layers of vectorizers? I don't know - presumably people like to write code that does similar things on multiple levels].

OK.

Also, the overall goal isn't ONLY to vectorize, but also to "improve overall loop performance". It may be that we can convince the compiler to do a better job in some other way, but right now, I'm not aware of what that should be. The fundamental problem is that the descriptors used to pass array information is getting in the way of lots of things. It should be noted that although we see performance improvement in roms_r, it does NOT come from vectorization of the loop constructs - both gfortran and flang uses loop versioning, and gets similar performance improvement - and neither compiler uses vector instructions in the hot loops in ROMS.

I am confused about how the performance gets improved for roms_r you mentioned. From the test in this patch, I can only see the optimization opportunity of loop vectorization after loop versioning. Maybe another test case extracted in roms_r showing how it gets performance improvement is better.

Because when the access to a 1D !fir.ref<!fir.array<*>> is simpler than the access of !fir.box<!fir.array<*>>. See example in the comment with a link to "related ticket".

BTW, can you attach the link of discussion in discourse in commit message.

Will do.

Thanks @MatsPetersson for working on this. I have started going through the patch and have some comments.

flang/include/flang/Tools/CLOptions.inc
204	Nit: No need for braces here.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2–3	Nit: This style is probably does not match the usual one.
13	Nit: I think you mean "looks for loops iterating over assumed-shape arrays".
79	Nit: spell out auto.
81	Nit: Could you provide some documentation for what this structure is for?
84	Nit: Why 6?
89	Nit: spell out auto.
96	Nit: Add a comment for the rank 3 requirement.
213	Nit: Add a comment on why reduction to single index is required. If any issues were faced when retaining the multiple dimensions, then please call out.
245
flang/test/Transforms/loop-versioing.fir
76 ↗	(On Diff #503460)
116 ↗	(On Diff #503460)	Nit: newline

kiranchandramohan added inline comments.Mar 16 2023, 4:48 PM

flang/include/flang/Optimizer/Transforms/Passes.td
292
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
64	Nit: spell out auto.
108	Nit: Should this be "Too many dimensions" or "Unsupported type"?
124	Nit: Is this assert really required?
126	Nit: expand auto.
129	Do you mean `op2->getParentOfType<fir::DoLoopOp>()` is an inner loop contained in `loop`?
136	Is there a better datastructure for argsOfInterest? like a set or something?
156	Nit: Please be more specific in the comment.
169	Nit: Add a comment saying something like "we have to ensure that all the assumed shape arguments used in the loop have unit-stride, so create a predicate that checks the stride of all of them".
172	Nit: expand auto
186–190	Nit: Braces required here.
217–231	Is using the `mlir::indexType` necessary here (and probably in other places)? I see more converts in the simplified version that in the original one.
245	Nit:The message in the assertion is too generic. Please replace with something specific to the transformation, like did not find any co-ordinate ops.

I failed to apply this patch locally.

In D141306#4201153, @peixin wrote:

I failed to apply this patch locally.

I think it needs rebasing. I will do that today, but I'm not sure whether I can fix all the comments from Kiran before I go on leave for two weeks. Most are trivial, but still takes time (and I've got a bug that I've been working on the past week that is higher priority).

Rebase only

In D141306#4201153, @peixin wrote:

I failed to apply this patch locally.

I have updated to the latest llvm/main branch, so should apply now.

Note that I have intentionally NOT fixed any of the review comments, I didn't want to fix a small number and leave the rest unresolved. I will fix that when I'm back from my break.

Harbormaster completed remote builds in B220098: Diff 506126.Mar 17 2023, 10:29 AM

peixin added inline comments.Mar 20 2023, 1:44 AM

flang/include/flang/Tools/CLOptions.inc
285–289	I still failed to apply this patch. There is no such code in current main branch.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

In D141306#4218704, @peixin wrote:
The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?
subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end
BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

Thanks @peixin for the comment. Mats is away and is only back in April.
We discussed what other compilers do in https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447. As you can imagine and as pointed out in one of the comments in the discourse thread, always creating copies only benefits if the called subroutine can take advantage of it and can amortise the cost of creating the copy.

Thanks @peixin for the comment. Mats is away and is only back in April.

OK. We can continue the discussion when Mats is back.

Rebase
Fix review comments

Leporacanthicus added inline comments.Apr 4 2023, 5:24 AM

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
108	Probably "both". :)
136	Maybe. I don't expect there to be MANY of these, and we would have to declare a comparison operators for the argInfo to make that work. Not convinced using llvm::SmallSet (or mlir::SmallSet is that useful).
186–190	Not sure if I put braces in the right place, but I've added some... :)
217–231	Without the converts, the code won't compile correctly - it gives a weird error message about `cannot be converted to LLVM IR: missing LLVMTranslationDialectInterface registration for dialect for op: <something>` It should only add conversion if the types actually need converting. I guess it would be possible to try to reduce the conversions, but it would eventually need conversion to index type anyway.

Harbormaster completed remote builds in B223545: Diff 510757.Apr 4 2023, 5:44 AM

In D141306#4218704, @peixin wrote:
The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?
subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end
BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

flang/include/flang/Tools/CLOptions.inc
285–289	I'm not sure exactly which part you are referring to "not in main" - there is a dependent patch in D141307 that introduces the loopVersioning option itself - is that what's missing, or am I completely missing something? Reference this: https://github.com/llvm/llvm-project/blob/main/flang/include/flang/Tools/CLOptions.inc#L285
flang/test/Transforms/loop-versioing.fir
76 ↗	(On Diff #503460)
116 ↗	(On Diff #503460)	Doh, missed the test comments. Have fixed locally, will update later.

SBallantyne added inline comments.Apr 4 2023, 8:56 AM

flang/test/Transforms/loop-versioing.fir
1 ↗	(On Diff #510757)	Nit: Typo, filename should loop-versioning.fir

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

Sorry. This case should be special case, and can be the future work.

This work is a good start, and I am inspired by it. Thanks.

Strange. One of my inlined comment is lost.

peixin added inline comments.Apr 4 2023, 9:16 AM

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
13	I think there should be more descriptions about purpose, considerations, and TODOs . This pass actually does two works: prepare for loop vectorize and expression sp ecialization where the expression is array stride. For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future. For expression specialization, extending to more dimensions and other arrays sho uld be future work. The unit stride should be common in Fortran. There should be one cost model in future, which can be described as TODO. I tri ed the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call in the loop, gfortran does not do the loop versioning. If there is an if statement , gfortran does it. It seems that gfortran does not perform the loop versioning un conditionally. The code size problem should also be considered. What about the option -Os/Oz? What about if the loop is very large? Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do yo u consider doing the loop versioning at the end of LLVM middle-end optimization pa ss? If so, classic-flang may also benefit from it considering LLVM Flang is still not ready for commercial usage.

In D141306#4243135, @Leporacanthicus wrote:

In D141306#4218704, @peixin wrote:

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

I combined my existing benchmark for 1D loop versioning (here: https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605/2), so that we run the vadd 250000 times in a loop, with a 4000 element array size with b as unit-stride (contiguous) and b as 2 * unit stride. The result with loop versioning, when the array is unit-stride is 0.33s, with non-unit stride it is 1.2s [for my x86-64 desktop machine]. Without loop versioning, all three variants take about 1.2s. That makes the vectorized (thanks to loop versioning) version 73% faster (or 3.6x faster).

I also ran the same benchmark, with a size of 10 instead of 4000 (and correspondingly more loops to . The result is STILL better for the versioned loop, but much lower than the large array (for many reasons, loop overhead, not able to use vector instructions for all elements, etc). With loop versioning: 0.76s for unit-stride, 1.2s for non-unit stride. Without loop versioning, the result is 1.2s for both scenarios (well, all three cases - but two of those are identical, just passing b and c into the function). For the benchmarks I did, I couldn't get a measurable slow-down for the versioned loop in the non-unit-stride variant - it may be measurable for a different case. [The call does get inlined in this benchmark, someting I actively tried to avoid in my original benchmarks].

Here's the initial, 4000 element version of the code:

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end subroutine vadd

subroutine do_bench(msg, cc, aa, bb, size)
  interface
     subroutine vadd(c, a, b, n)
       implicit none
       integer :: n
       real :: c(:), a(:), b(:)
     end subroutine vadd
  end interface
  character(*) :: msg
  integer, parameter :: loops = 250000
  integer:: size
  real :: aa(1:)
  real :: bb(1:)
  real :: cc(1:)
    
  real*8 :: time, time_start, time_end
  integer::i

  call CPU_TIME(time_start)
  do i = 1, loops
     call vadd(cc, aa, bb, size)
  end do
  call CPU_TIME(time_end)
  time = time_end - time_start
  print "(A12, F8.5, A2)", msg, time, " s"
end subroutine do_bench

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end subroutine vadd

  subroutine do_bench(msg, cc, aa, bb, size)
    character(*) :: msg
    integer:: size
    real :: aa(1:)
    real :: bb(1:)
    real :: cc(1:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.

  integer, parameter :: size = 4000
  real :: aa(size)
  real :: bb(size)
  real :: cc(size * 2)
  real :: res(size)
  
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c

  aa = 1
  bb = 2
  cc = 3
  res = 0

  call do_bench("a + b", res, aa, bb, size)
  call do_bench("a + c", res, aa, cc, size)
  call do_bench("a + c(::2)", res, aa, cc(::2), size)
  
end

For the 10 element version, just replace the looops = 250000 with loops = 250000 * 400, and size = 4000 with size = 10. Number of loops is choosen to make a "reasonable amount of time on my machine" - I try to aim for benchmarks of this kind that run around 0.5-1.5s, and I just used the values from my original benchmark to start with,.

Yes, the vadd code is definitely quite a bit longer - about 3.4 times (428 vs 126 bytes) for my x86-64 code - this will clearly vary depending on the processor architecture, so Arm may get a different result. Much of that is due to the fact that the loop also gets vectorized, which expands the original code a fair bit, because THAT also has to include the code to resolve the last few elements if the size isn't a multiple of vector size. Being able to vectorize loops is definitely one of the goals here.

As with pretty much any optimisation, not everything will be positive. In this case, larger code-size, and possibly also some cases of "marginally slower". I've not found any case where it is actually slower, but I'm sure there are cases where this happens. [A couple of years ago, I was working on a hobby project compiler, where I called a runtime function to count number of bits "set" in an array of i32 elements, which, where the compiler decided that "oh, you probably do this on large arrays, so I'll vectorize the loop" - for the application that I was working on, the size of the array was always 1 - so it was bady punished by the "check if we can use vector instructions" - I can't remember if I tried to fix the function, or went straight to "form the correct code inline instead" - now that project has an inline count bits, either way].

(I have seen the reply to my previous comment, but since I'd almost written all this already, I completed it)

Change wording in several comments.
Change one "auto" to real type.
Rename misspelled filename and add newline at the end of the file

Leporacanthicus marked an inline comment as done.Apr 6 2023, 7:49 AM

Leporacanthicus added inline comments.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
13	Thanks a lot for your in depth comments. I have tried my best to answer below, and updated some of the comment explanation to the pass. I think there should be more descriptions about purpose, considerations, and TODOs . This pass actually does two works: prepare for loop vectorize and expression specialization where the expression is array stride. I have added more text t to the description of the pass in general, describing some of what we've discussed in various comments on this review, which should make it clearer what the pass does, why it is needed, etc. For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future. From what I can tell, the problem is that loop vectorizer doesn't understand how Fortran descriptors work, and just sees them as structures that contain some data used by the loop - which makes it decide that "I have no idea what the stride is, so I can't vectorize this". I looked into, but not VERY hard, solving the loop vectorizer problem, but I couldn't get my head around how you'd figure out when to try to do this and when not. In FIR, we can know this is a descriptor for an array. There should be one cost model in future, which can be described as TODO. Thanks, I have added a TODO to that effect. I tried the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call in the loop, gfortran does not do the loop versioning. If there is an if statement, gfortran does it. It seems that gfortran does not perform the loop versioning unconditionally. I'm not entirely sure why you wouldn't do this just because there is a call in there. Do you know why it doesn't? Of course, if someone is calling, say, `print *, a` (where `a` is a 1000+ element array), it's probably not a great ideal to transform that loop. I have added a TODO to this effect. The code size problem should also be considered. What about the option -Os/Oz? What about if the loop is very large? I believe -Os/-Oz is not currently supported in Flang (-Os didn't work when I just tried it, and I'm sure there was a discourse thread on it just a few days back), but the code for that (in D141307, file clang/lib/Driver/ToolChains/Flang.cpp lines 69-76) should return false for "not a number". It is only enabled for > O2 [along with -Ofast] - or if someone specifically asks for the `-fversion-loops-for-stride`. A limit for large loops is probably not a bad thing. Where to set that limit, I haven't got much of an idea... Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do you consider doing the loop versioning at the end of LLVM middle-end optimization pass? If so, classic-flang may also benefit from it considering LLVM Flang is still not ready for commercial usage. As discussed above, it's the problem of "understanding what the code does" that I talked about above. Classic flang relies on the "copy to a contiguous array", so there wouldn't be any benefit [unless we remove that part - I have no idea how easy/hard it would be to do that].
129	Rewritten the comment to make it clearer.

Harbormaster completed remote builds in B224015: Diff 511417.Apr 6 2023, 8:04 AM

Update moved header file and include kindmapping that we rely on.

Harbormaster completed remote builds in B224028: Diff 511433.Apr 6 2023, 8:57 AM

Add size to SmallVector to allow Windows build.

Harbormaster completed remote builds in B224044: Diff 511455.Apr 6 2023, 10:22 AM

Thanks for the update. The whole design looks good to me now. Thanks for the work.

flang/include/flang/Optimizer/Transforms/Passes.td
292	nit: American english is suggested. You used it in `LoopVersioning.cpp`.
294	American english is suggested.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2	nit
17
27	Is this one editor problem? wield indentation, also for the following.
29

Review comment updates. All are comment wording and formatting, no actual code changes

Rebase, which includes trivial conflict in Passes.td

flang/include/flang/Optimizer/Transforms/Passes.td
292	Having spent far too much time reading and writing a mix of British and American English, I never quite know which I'm actually using - or should use... ;)
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2	I removed the brief description - there's an actual description further down, and I don't think there's much point in having a one-line [that isn't quite one line] description too. There is a mixture of the two approaches, but I think reading the long comment 10 lines further down is good enough.
2–3	I have changed it, removing the "too long" description altogether, see above.
27	No, it's me running clang-format and not spotting that it's modified some comments to not look good. ;) [I guess it would be POSSIBLE to set the editor to break lines too! Or to run clang-format on a region]

Harbormaster completed remote builds in B224827: Diff 512533.Apr 11 2023, 11:55 AM

Rebase only

Harbormaster completed remote builds in B225146: Diff 512930.Apr 12 2023, 12:41 PM

Leporacanthicus marked 5 inline comments as done.Apr 14 2023, 10:00 AM

Leporacanthicus added inline comments.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
195–198	Not done this.

Leporacanthicus marked 10 inline comments as done.Apr 14 2023, 10:14 AM

Leporacanthicus added inline comments.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
136	I considered this, but you at least need a custom `operator<` for that to work, and I don't think it provides a real benefit.

Fix unnecessary curly braces

Harbormaster completed remote builds in B225701: Diff 513714.Apr 14 2023, 12:45 PM

LGTM. Thanks for working on this and addressing all the comments.

I am requesting further tests, Nits and also some additional checks to avoid failures when loops are generated.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
37
40
155	Nit: Would `loopsOfInterest` be a better name?
160	Nit: Why is this called `op2` and not just `op`?
203–226	Nit: Please add a test for comparing the dimension of more than one array.
248	Nit: we need a test for the multi-dimension to single dimension reduction.
285	Nit: Can we add a comment for why this ResultOP is needed here and below? Propagate the results from the If? I think there are cases where there are no results in the loop, it might not be correct to insert one here in that case. This happens in cases where the loop is generated. subroutine test3(x, y) integer :: y(:) integer :: x(:) read(,) x(y) end subroutine
288	Nit: Could you check the following test? subroutine test4(a, b, n1, m1) real :: a(:) real :: b(:,:) a = [ ((b(i,j), j=1,n1,m1), i=1,n1,m1) ] end subroutine test4
flang/test/Transforms/loop-versioning.fir
110–114	Nit: Would it be better to check that there are not two `fir.do_loop`?

Review comment changes:

Add checks for "no result" from the operations
Fix code generated for array initializers
Change the names of op2 -> op, and opsOfInterest -> loopsOfInterest
Some typos in comments fixed

Leporacanthicus marked 2 inline comments as done.Apr 17 2023, 2:24 PM

Leporacanthicus marked an inline comment as done.Apr 17 2023, 2:37 PM

Harbormaster completed remote builds in B226221: Diff 514404.Apr 17 2023, 3:46 PM

Closed by commit rGa716ace13dca: Add loop-versioning pass to improve unit-stride (authored by MatsPetersson). · Explain WhyApr 18 2023, 1:53 AM

This revision was automatically updated to reflect the committed changes.

MatsPetersson added a commit: rGa716ace13dca: Add loop-versioning pass to improve unit-stride.

Revision Contents

Path

Size

flang/

include/

flang/

Optimizer/

Transforms/

Passes.h

1 line

Passes.td

12 lines

Tools/

CLOptions.inc

4 lines

lib/

Optimizer/

Transforms/

CMakeLists.txt

1 line

LoopVersioning.cpp

313 lines

test/

Transforms/

loop-versioning.fir

115 lines

Diff 514567

flang/include/flang/Optimizer/Transforms/Passes.h

	Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
	std::unique_ptr<mlir::Pass>			std::unique_ptr<mlir::Pass>
	createExternalNameConversionPass(bool appendUnderscore);			createExternalNameConversionPass(bool appendUnderscore);
	std::unique_ptr<mlir::Pass> createMemDataFlowOptPass();			std::unique_ptr<mlir::Pass> createMemDataFlowOptPass();
	std::unique_ptr<mlir::Pass> createPromoteToAffinePass();			std::unique_ptr<mlir::Pass> createPromoteToAffinePass();
	std::unique_ptr<mlir::Pass> createMemoryAllocationPass();			std::unique_ptr<mlir::Pass> createMemoryAllocationPass();
	std::unique_ptr<mlir::Pass> createStackArraysPass();			std::unique_ptr<mlir::Pass> createStackArraysPass();
	std::unique_ptr<mlir::Pass> createSimplifyIntrinsicsPass();			std::unique_ptr<mlir::Pass> createSimplifyIntrinsicsPass();
	std::unique_ptr<mlir::Pass> createAddDebugFoundationPass();			std::unique_ptr<mlir::Pass> createAddDebugFoundationPass();
				std::unique_ptr<mlir::Pass> createLoopVersioningPass();

	std::unique_ptr<mlir::Pass>			std::unique_ptr<mlir::Pass>
	createMemoryAllocationPass(bool dynOnHeap, std::size_t maxStackSize);			createMemoryAllocationPass(bool dynOnHeap, std::size_t maxStackSize);
	std::unique_ptr<mlir::Pass> createAnnotateConstantOperandsPass();			std::unique_ptr<mlir::Pass> createAnnotateConstantOperandsPass();
	std::unique_ptr<mlir::Pass> createSimplifyRegionLitePass();			std::unique_ptr<mlir::Pass> createSimplifyRegionLitePass();
	std::unique_ptr<mlir::Pass> createAlgebraicSimplificationPass();			std::unique_ptr<mlir::Pass> createAlgebraicSimplificationPass();
	std::unique_ptr<mlir::Pass>			std::unique_ptr<mlir::Pass>
	createAlgebraicSimplificationPass(const mlir::GreedyRewriteConfig &config);			createAlgebraicSimplificationPass(const mlir::GreedyRewriteConfig &config);
	Show All 10 Lines

flang/include/flang/Optimizer/Transforms/Passes.td

Show First 20 Lines • Show All 270 Lines • ▼ Show 20 Lines def AlgebraicSimplification : Pass<"flang-algebraic-simplification"> {

let constructor = "::fir::createAlgebraicSimplificationPass()"; let constructor = "::fir::createAlgebraicSimplificationPass()";

} }

def PolymorphicOpConversion : Pass<"fir-polymorphic-op", "::mlir::func::FuncOp"> { def PolymorphicOpConversion : Pass<"fir-polymorphic-op", "::mlir::func::FuncOp"> {

let summary = let summary =

"Simplify operations on polymorphic types"; "Simplify operations on polymorphic types";

let description = [{ let description = [{

This pass breaks up the lowering of operations on polymorphic types by This pass breaks up the lowering of operations on polymorphic types by

introducing an intermediate FIR level that simplifies code geneation. introducing an intermediate FIR level that simplifies code geneation.

peixinUnsubmitted

Done

array is an assumed size array, to optimise for the (often common) case where

- an array has element sized strid. A fixed stride allows for the loop to be

+ an array has element sized stride. A fixed stride allows for the loop to be

vectorized as well as other loop optimisations.

peixin:

}]; }];

let constructor = "::fir::createPolymorphicOpConversionPass()"; let constructor = "::fir::createPolymorphicOpConversionPass()";

let dependentDialects = [ let dependentDialects = [

"fir::FIROpsDialect", "mlir::func::FuncDialect" "fir::FIROpsDialect", "mlir::func::FuncDialect"

]; ];

} }

def OpenACCDataOperandConversion : Pass<"fir-openacc-data-operand-conversion", "::mlir::func::FuncOp"> { def OpenACCDataOperandConversion : Pass<"fir-openacc-data-operand-conversion", "::mlir::func::FuncOp"> {

let summary = "Convert the FIR operands in OpenACC ops to LLVM dialect"; let summary = "Convert the FIR operands in OpenACC ops to LLVM dialect";

let dependentDialects = ["mlir::LLVM::LLVMDialect"]; let dependentDialects = ["mlir::LLVM::LLVMDialect"];

let options = [ let options = [

Option<"useOpaquePointers", "use-opaque-pointers", "bool", Option<"useOpaquePointers", "use-opaque-pointers", "bool",

/*default=*/"true", "Generate LLVM IR using opaque pointers " /*default=*/"true", "Generate LLVM IR using opaque pointers "

kiranchandramohanUnsubmitted

Done

Loop Versioning pass adds a check and two variants of a loop when the input

- array is an assumed size array, to optimise for the (often common) case where

+ array is an assumed shape array, to optimise for the (often common) case where

an array has element sized stride. The element sizes stride allows some

kiranchandramohan:

peixinUnsubmitted

Done

Loop Versioning pass adds a check and two variants of a loop when the input

- array is an assumed shape array, to optimise for the (often common) case where

+ array is an assumed shape array, to optimize for the (often common) case where

an array has element sized stride. The element sizes stride allows some

nit: American english is suggested. You used it in LoopVersioning.cpp.

peixin: nit: American english is suggested. You used it in `LoopVersioning.cpp`.

LeporacanthicusAuthorUnsubmitted

Done

Having spent far too much time reading and writing a mix of British and American English, I never quite know which I'm actually using - or should use... ;)

Leporacanthicus: Having spent far too much time reading and writing a mix of British and American English, I…

"instead of typed pointers">, "instead of typed pointers">,

tblahUnsubmitted

Done

array is an assumed size array, to optimise for the (often common) case where

- an array has element sized strid. A fixed stride allows for the loop to be

+ an array has element sized stride. A fixed stride allows for the loop to be

vectorized as well as other loop optimisations.

tblah:

]; ];

peixinUnsubmitted

Done

I am curious what other optimizations would benefit from this?

peixin: I am curious what other optimizations would benefit from this?

LeporacanthicusAuthorUnsubmitted

Done

Just generally "making loops simpler" (less indirect accesses), which allows the compiler to do a better job - the "other loop optimisatioins" is not specific passes or optimisation steps as such, just "improvement in the overall code being produced".

Leporacanthicus: Just generally "making loops simpler" (less indirect accesses), which allows the compiler to do…

peixinUnsubmitted

Done

an array has element sized stride. The element sizes stride allows some

- loops to be vectorized as well as other loop optimisations.

+ loops to be vectorized as well as other loop optimizations.

}];

let constructor = "::fir::createLoopVersioningPass()";

American english is suggested.

peixin: American english is suggested.

} }

def LoopVersioning : Pass<"loop-versioning", "mlir::func::FuncOp"> {

let summary = "Loop Versioning";

let description = [{

Loop Versioning pass adds a check and two variants of a loop when the input

array is an assumed shape array, to optimize for the (often common) case where

an array has element sized stride. The element sizes stride allows some

loops to be vectorized as well as other loop optimizations.

}];

let constructor = "::fir::createLoopVersioningPass()";

let dependentDialects = [ "fir::FIROpsDialect" ];

}

#endif // FLANG_OPTIMIZER_TRANSFORMS_PASSES #endif // FLANG_OPTIMIZER_TRANSFORMS_PASSES

flang/include/flang/Tools/CLOptions.inc

Show First 20 Lines • Show All 194 Lines • ▼ Show 20 Lines	inline void createDefaultFIROptimizerPassPipeline(mlir::PassManager &pm,
pm.addNestedPass<mlir::func::FuncOp>(fir::createCharacterConversionPass());		pm.addNestedPass<mlir::func::FuncOp>(fir::createCharacterConversionPass());
pm.addPass(mlir::createCanonicalizerPass(config));		pm.addPass(mlir::createCanonicalizerPass(config));
pm.addPass(fir::createSimplifyRegionLitePass());		pm.addPass(fir::createSimplifyRegionLitePass());
if (optLevel.isOptimizingForSpeed()) {		if (optLevel.isOptimizingForSpeed()) {
// These passes may increase code size.		// These passes may increase code size.
pm.addPass(fir::createSimplifyIntrinsicsPass());		pm.addPass(fir::createSimplifyIntrinsicsPass());
pm.addPass(fir::createAlgebraicSimplificationPass(config));		pm.addPass(fir::createAlgebraicSimplificationPass(config));
}		}

		if (loopVersioning)
		kiranchandramohanUnsubmitted Done Reply Inline Actions Nit: No need for braces here. kiranchandramohan: Nit: No need for braces here.
		pm.addPass(fir::createLoopVersioningPass());
		tblahUnsubmitted Done Reply Inline Actions nit: I think this should be wrapped in LLVM_DEBUG() and be written to llvm::dbgs() tblah: nit: I think this should be wrapped in LLVM_DEBUG() and be written to llvm::dbgs()
		LeporacanthicusAuthorUnsubmitted Done Reply Inline Actions Removing, it was left behind when I wasn't sure if the pass was being run or not. None of the other passes print when they are created. Leporacanthicus: Removing, it was left behind when I wasn't sure if the pass was being run or not. None of the…

pm.addPass(mlir::createCSEPass());		pm.addPass(mlir::createCSEPass());

if (stackArrays)		if (stackArrays)
pm.addPass(fir::createStackArraysPass());		pm.addPass(fir::createStackArraysPass());
else		else
fir::addMemoryAllocationOpt(pm);		fir::addMemoryAllocationOpt(pm);

// The default inliner pass adds the canonicalizer pass with the default		// The default inliner pass adds the canonicalizer pass with the default
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines

/// Create a pass pipeline for lowering from MLIR to LLVM IR		/// Create a pass pipeline for lowering from MLIR to LLVM IR
///		///
/// \param pm - MLIR pass manager that will hold the pipeline definition		/// \param pm - MLIR pass manager that will hold the pipeline definition
/// \param optLevel - optimization level used for creating FIR optimization		/// \param optLevel - optimization level used for creating FIR optimization
/// passes pipeline		/// passes pipeline
inline void createMLIRToLLVMPassPipeline(mlir::PassManager &pm,		inline void createMLIRToLLVMPassPipeline(mlir::PassManager &pm,
llvm::OptimizationLevel optLevel = defaultOptLevel,		llvm::OptimizationLevel optLevel = defaultOptLevel,
bool stackArrays = false, bool underscoring = true,		bool stackArrays = false, bool underscoring = true,
bool loopVersioning = false,		bool loopVersioning = false,
llvm::codegenoptions::DebugInfoKind debugInfo = NoDebugInfo) {		llvm::codegenoptions::DebugInfoKind debugInfo = NoDebugInfo) {
fir::createHLFIRToFIRPassPipeline(pm, optLevel);		fir::createHLFIRToFIRPassPipeline(pm, optLevel);

peixinUnsubmitted Done Reply Inline Actions I still failed to apply this patch. There is no such code in current main branch. peixin: I still failed to apply this patch. There is no such code in current main branch.
LeporacanthicusAuthorUnsubmitted Done Reply Inline Actions I'm not sure exactly which part you are referring to "not in main" - there is a dependent patch in D141307 that introduces the loopVersioning option itself - is that what's missing, or am I completely missing something? Reference this: https://github.com/llvm/llvm-project/blob/main/flang/include/flang/Tools/CLOptions.inc#L285 Leporacanthicus: I'm not sure exactly which part you are referring to "not in main" - there is a dependent patch…
// Add default optimizer pass pipeline.		// Add default optimizer pass pipeline.
fir::createDefaultFIROptimizerPassPipeline(		fir::createDefaultFIROptimizerPassPipeline(
pm, optLevel, stackArrays, loopVersioning);		pm, optLevel, stackArrays, loopVersioning);

// Add codegen pass pipeline.		// Add codegen pass pipeline.
fir::createDefaultFIRCodeGenPassPipeline(		fir::createDefaultFIRCodeGenPassPipeline(
pm, optLevel, underscoring, debugInfo);		pm, optLevel, underscoring, debugInfo);
}		}
#undef FLANG_EXCLUDE_CODEGEN		#undef FLANG_EXCLUDE_CODEGEN
#endif		#endif

} // namespace fir		} // namespace fir

flang/lib/Optimizer/Transforms/CMakeLists.txt

Show All 10 Lines	add_flang_library(FIRTransforms
StackArrays.cpp		StackArrays.cpp
MemRefDataFlowOpt.cpp		MemRefDataFlowOpt.cpp
SimplifyRegionLite.cpp		SimplifyRegionLite.cpp
AlgebraicSimplification.cpp		AlgebraicSimplification.cpp
SimplifyIntrinsics.cpp		SimplifyIntrinsics.cpp
AddDebugFoundation.cpp		AddDebugFoundation.cpp
PolymorphicOpConversion.cpp		PolymorphicOpConversion.cpp
OpenACC/OpenACCDataOperandConversion.cpp		OpenACC/OpenACCDataOperandConversion.cpp
		LoopVersioning.cpp

DEPENDS		DEPENDS
FIRDialect		FIRDialect
FIROptTransformsPassIncGen		FIROptTransformsPassIncGen

LINK_LIBS		LINK_LIBS
FIRBuilder		FIRBuilder
FIRCodeGen		FIRCodeGen
Show All 13 Lines

flang/lib/Optimizer/Transforms/LoopVersioning.cpp

This file was added.

//===- LoopVersioning.cpp -------------------------------------------------===//

peixinUnsubmitted

Done

- //===- LoopVersioning.cpp -- improve loop performance by duplicating certain

+ //===- LoopVersioning.cpp -- improve loop performance by versioning certain

// loops -----===//

nit

peixin: nit

LeporacanthicusAuthorUnsubmitted

Done

I removed the brief description - there's an actual description further down, and I don't think there's much point in having a one-line [that isn't quite one line] description too. There is a mixture of the two approaches, but I think reading the long comment 10 lines further down is good enough.

Leporacanthicus: I removed the brief description - there's an actual description further down, and I don't think…

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

kiranchandramohanUnsubmitted

Done

Nit: This style is probably does not match the usual one.

kiranchandramohan: Nit: This style is probably does not match the usual one.

LeporacanthicusAuthorUnsubmitted

Done

I have changed it, removing the "too long" description altogether, see above.

Leporacanthicus: I have changed it, removing the "too long" description altogether, see above.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

/// \file

/// This pass looks for loops iterating over assumed-shape arrays, that can

/// be optimized by "guessing" that the stride is element-sized.

///

kiranchandramohanUnsubmitted

Done

Nit: I think you mean "looks for loops iterating over assumed-shape arrays".

kiranchandramohan: Nit: I think you mean "looks for loops iterating over assumed-shape arrays".

peixinUnsubmitted

Done

I think there should be more descriptions about purpose, considerations, and TODOs
.

This pass actually does two works: prepare for loop vectorize and expression sp

ecialization where the expression is array stride.

For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future.
For expression specialization, extending to more dimensions and other arrays sho

uld be future work. The unit stride should be common in Fortran.

There should be one cost model in future, which can be described as TODO. I tri

ed the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call
in the loop, gfortran does not do the loop versioning. If there is an if statement
, gfortran does it. It seems that gfortran does not perform the loop versioning un
conditionally.

The code size problem should also be considered. What about the option -Os/Oz?

What about if the loop is very large?

Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do yo

u consider doing the loop versioning at the end of LLVM middle-end optimization pa
ss? If so, classic-flang may also benefit from it considering LLVM Flang is still
not ready for commercial usage.

peixin: I think there should be more descriptions about purpose, considerations, and TODOs . 1. This…

LeporacanthicusAuthorUnsubmitted

Done

Thanks a lot for your in depth comments. I have tried my best to answer below, and updated some of the comment explanation to the pass.

I think there should be more descriptions about purpose, considerations, and TODOs
.

This pass actually does two works: prepare for loop vectorize and expression specialization where the expression is array stride.

I have added more text t to the description of the pass in general, describing some of what we've discussed in various comments on this review, which should make it clearer what the pass does, why it is needed, etc.

For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future.

From what I can tell, the problem is that loop vectorizer doesn't understand how Fortran descriptors work, and just sees them as structures that contain some data used by the loop - which makes it decide that "I have no idea what the stride is, so I can't vectorize this". I looked into, but not VERY hard, solving the loop vectorizer problem, but I couldn't get my head around how you'd figure out when to try to do this and when not. In FIR, we can know this is a descriptor for an array.

There should be one cost model in future, which can be described as TODO.

Thanks, I have added a TODO to that effect.

I tried the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call
in the loop, gfortran does not do the loop versioning. If there is an if statement, gfortran does it. It seems that gfortran does not perform the loop versioning unconditionally.

I'm not entirely sure why you wouldn't do this just because there is a call in there. Do you know why it doesn't? Of course, if someone is calling, say, print *, a (where a is a 1000+ element array), it's probably not a great ideal to transform that loop.

I have added a TODO to this effect.

The code size problem should also be considered. What about the option -Os/Oz?

What about if the loop is very large?

I believe -Os/-Oz is not currently supported in Flang (-Os didn't work when I just tried it, and I'm sure there was a discourse thread on it just a few days back), but the code for that (in D141307, file clang/lib/Driver/ToolChains/Flang.cpp lines 69-76) should return false for "not a number". It is only enabled for > O2 [along with -Ofast] - or if someone specifically asks for the -fversion-loops-for-stride.

A limit for large loops is probably not a bad thing. Where to set that limit, I haven't got much of an idea...

Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do you consider doing the loop versioning at the end of LLVM middle-end optimization pass? If so, classic-flang may also benefit from it considering LLVM Flang is still not ready for commercial usage.

As discussed above, it's the problem of "understanding what the code does" that I talked about above.

Classic flang relies on the "copy to a contiguous array", so there wouldn't be any benefit [unless we remove that part - I have no idea how easy/hard it would be to do that].

Leporacanthicus: Thanks a lot for your in depth comments. I have tried my best to answer below, and updated some…

/// This is done by createing two versions of the same loop: one which assumes

/// that the elements are contiguous (stride == size of element), and one that

/// is the original generic loop.

///

peixinUnsubmitted

Done

/// This is done by createing two versions of the same loop: one which assumes

- /// that the elements are contiguous (stride = size of element), and one that is

+ /// that the elements are contiguous (stride == size of element), and one that is

/// the original generic loop.

peixin:

/// As a side-effect of the assumed element size stride, the array is also

/// flattened to make it a 1D array - this is because the internal array

/// structure must be either 1D or have known sizes in all dimensions - and at

/// least one of the dimensions here is already unknown.

///

/// There are two distinct benefits here:

/// 1. The loop that iterates over the elements is somewhat simplified by the

/// constant stride calculation.

/// 2. Since the compiler can understand the size of the stride, it can use

/// vector instructions, where an unknown (at compile time) stride does often

peixinUnsubmitted

Done

Is this one editor problem? wield indentation, also for the following.

peixin: Is this one editor problem? wield indentation, also for the following.

LeporacanthicusAuthorUnsubmitted

Done

No, it's me running clang-format and not spotting that it's modified some comments to not look good. ;)

[I guess it would be POSSIBLE to set the editor to break lines too! Or to run clang-format on a region]

Leporacanthicus: No, it's me running clang-format and not spotting that it's modified some comments to not look…

/// prevent vector operations from being used.

///

peixinUnsubmitted

Done

/// stride calculation.

- /// 2. Since the compiler can understand the size of the stride, it is can use

+ /// 2. Since the compiler can understand the size of the stride, it can use

/// vector

peixin:

/// A known drawback is that the code-size is increased, in some cases that can

/// be quite substantial - 3-4x is quite plausible (this includes that the loop

/// gets vectorized, which in itself often more than doubles the size of the

/// code, because unless the loop size is known, there will be a modulo

/// vector-size remainder to deal with.

///

/// TODO: Do we need some size limit where loops no longer get duplicated?

// Maybe some sort of cost analysis.

kiranchandramohanUnsubmitted

Done

/// vector-size remainder to deal with.

///

- /// TODO: Do we need some size limit where loops no longer get duplicated.

+ /// TODO: Do we need some size limit where loops no longer get duplicated?

// Maybe some sort of cost analysis.

kiranchandramohan:

/// TODO: Should some loop content - for example calls to functions and

/// subroutines inhibit the versioning of the loops. Plausibly, this

/// could be part of the cost analysis above.

kiranchandramohanUnsubmitted

Done

/// TODO: Should some loop content - for example calls to functions and

- /// subroutines make inhibit the versioning of the loops. Plausibly, this

+ /// subroutines inhibit the versioning of the loops? Plausibly, this

/// could be part of the cost analysis above.

kiranchandramohan:

//===----------------------------------------------------------------------===//

#include "flang/ISO_Fortran_binding.h"

#include "flang/Optimizer/Builder/BoxValue.h"

#include "flang/Optimizer/Builder/FIRBuilder.h"

#include "flang/Optimizer/Builder/Runtime/Inquiry.h"

#include "flang/Optimizer/Dialect/FIRDialect.h"

#include "flang/Optimizer/Dialect/FIROps.h"

#include "flang/Optimizer/Dialect/FIRType.h"

#include "flang/Optimizer/Dialect/Support/FIRContext.h"

#include "flang/Optimizer/Dialect/Support/KindMapping.h"

#include "flang/Optimizer/Transforms/Passes.h"

#include "mlir/Dialect/LLVMIR/LLVMDialect.h"

#include "mlir/IR/Matchers.h"

#include "mlir/IR/TypeUtilities.h"

#include "mlir/Pass/Pass.h"

#include "mlir/Transforms/DialectConversion.h"

#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

#include "mlir/Transforms/RegionUtils.h"

#include "llvm/Support/Debug.h"

#include "llvm/Support/raw_ostream.h"

#include <algorithm>

kiranchandramohanUnsubmitted

Done

Nit: spell out auto.

kiranchandramohan: Nit: spell out auto.

namespace fir {

#define GEN_PASS_DEF_LOOPVERSIONING

#include "flang/Optimizer/Transforms/Passes.h.inc"

} // namespace fir

#define DEBUG_TYPE "flang-loop-versioning"

namespace {

tblahUnsubmitted

Done

See hlfir::getFortranElementOrSequenceType in HLFIRDialect.h. If you want to explicitly not support hlfir::Expr, you can still unwrap the box using fir::unwrapPassByRefType(fir::unwrapRefType(type)).

I think mlir::Types can be null so the std::optional isn't needed

tblah: See `hlfir::getFortranElementOrSequenceType` in `HLFIRDialect.h`. If you want to explicitly not…

LeporacanthicusAuthorUnsubmitted

Done

Removed the optional.

I've simplified the code using the unwrap functions. I'm not sure there is any benefit in bringing HLFIR into this - we shouldn't have HLFIR types here, should we? It also looks like the HLFIR code creates a NEW type, which seems a bit unnecessary.

Leporacanthicus: Removed the optional. I've simplified the code using the unwrap functions. I'm not sure there…

class LoopVersioningPass

: public fir::impl::LoopVersioningBase<LoopVersioningPass> {

public:

void runOnOperation() override;

};

kiranchandramohanUnsubmitted

Done

Nit: spell out auto.

kiranchandramohan: Nit: spell out auto.

} // namespace

kiranchandramohanUnsubmitted

Done

Nit: Could you provide some documentation for what this structure is for?

kiranchandramohan: Nit: Could you provide some documentation for what this structure is for?

/// @c replaceOuterUses - replace uses outside of @c op with result of @c

/// outerOp

kiranchandramohanUnsubmitted

Done

Nit: Why 6?

kiranchandramohan: Nit: Why 6?

static void replaceOuterUses(mlir::Operation *op, mlir::Operation *outerOp) {

const mlir::Operation *outerParent = outerOp->getParentOp();

op->replaceUsesWithIf(outerOp, [&](mlir::OpOperand &operand) {

mlir::Operation *owner = operand.getOwner();

return outerParent == owner->getParentOp();

kiranchandramohanUnsubmitted

Done

Nit: spell out auto.

kiranchandramohan: Nit: spell out auto.

});

}

static fir::SequenceType getAsSequenceType(mlir::Value *v) {

mlir::Type argTy = fir::unwrapPassByRefType(fir::unwrapRefType(v->getType()));

return argTy.dyn_cast<fir::SequenceType>();

}

kiranchandramohanUnsubmitted

Done

Nit: Add a comment for the rank 3 requirement.

kiranchandramohan: Nit: Add a comment for the rank 3 requirement.

void LoopVersioningPass::runOnOperation() {

LLVM_DEBUG(llvm::dbgs() << "=== Begin " DEBUG_TYPE " ===\n");

mlir::func::FuncOp func = getOperation();

/// @c ArgInfo

/// A structure to hold an argument, the size of the argument and dimension

/// information.

struct ArgInfo {

mlir::Value *arg;

size_t size;

fir::BoxDimsOp dims[CFI_MAX_RANK];

kiranchandramohanUnsubmitted

Done

Nit: Should this be "Too many dimensions" or "Unsupported type"?

kiranchandramohan: Nit: Should this be "Too many dimensions" or "Unsupported type"?

LeporacanthicusAuthorUnsubmitted

Done

Probably "both". :)

Leporacanthicus: Probably "both". :)

};

// First look for arguments with assumed shape = unknown extent in the lowest

tblahUnsubmitted

Done

This should be inside of LLVM_DEBUG

tblah: This should be inside of `LLVM_DEBUG`

// dimension.

LLVM_DEBUG(llvm::dbgs() << "Func-name:" << func.getSymName() << "\n");

mlir::Block::BlockArgListType args = func.getArguments();

mlir::ModuleOp module = func->getParentOfType<mlir::ModuleOp>();

fir::KindMapping kindMap = fir::getKindMapping(module);

mlir::SmallVector<ArgInfo> argsOfInterest;

for (auto &arg : args) {

if (auto seqTy = getAsSequenceType(&arg)) {

unsigned rank = seqTy.getDimension();

// Currently limited to 1D or 2D arrays as that seems to give good

// improvement without excessive increase in code-size, etc.

if (rank > 0 && rank < 3 &&

seqTy.getShape()[0] == fir::SequenceType::getUnknownExtent()) {

kiranchandramohanUnsubmitted

Done

Nit: Is this assert really required?

kiranchandramohan: Nit: Is this assert really required?

size_t typeSize = 0;

mlir::Type elementType = fir::unwrapSeqOrBoxedSeqType(arg.getType());

tblahUnsubmitted

Done

This can be written as func.walk([&](fir::DoLoopOp loop) {. Then the if statement isn't required.

tblah: This can be written as `func.walk([&](fir::DoLoopOp loop) {`. Then the if statement isn't…

LeporacanthicusAuthorUnsubmitted

Done

Also useful in a few other places! :)

Leporacanthicus: Also useful in a few other places! :)

kiranchandramohanUnsubmitted

Done

Nit: expand auto.

kiranchandramohan: Nit: expand auto.

if (elementType.isa<mlir::FloatType>() ||

elementType.isa<mlir::IntegerType>())

typeSize = elementType.getIntOrFloatBitWidth() / 8;

kiranchandramohanUnsubmitted

Done

Do you mean op2->getParentOfType<fir::DoLoopOp>() is an inner loop contained in loop?

kiranchandramohan: Do you mean `op2->getParentOfType<fir::DoLoopOp>()` is an inner loop contained in `loop`?

LeporacanthicusAuthorUnsubmitted

Done

Rewritten the comment to make it clearer.

Leporacanthicus: Rewritten the comment to make it clearer.

else if (auto cty = elementType.dyn_cast<fir::ComplexType>())

typeSize = 2 * cty.getEleType(kindMap).getIntOrFloatBitWidth() / 8;

if (typeSize)

argsOfInterest.push_back({&arg, typeSize, {}});

else

LLVM_DEBUG(llvm::dbgs() << "Type not supported\n");

kiranchandramohanUnsubmitted

Done

Is there a better datastructure for argsOfInterest? like a set or something?

kiranchandramohan: Is there a better datastructure for argsOfInterest? like a set or something?

LeporacanthicusAuthorUnsubmitted

Done

Maybe. I don't expect there to be MANY of these, and we would have to declare a comparison operators for the argInfo to make that work. Not convinced using llvm::SmallSet (or mlir::SmallSet is that useful).

Leporacanthicus: Maybe. I don't expect there to be MANY of these, and we would have to declare a comparison…

LeporacanthicusAuthorUnsubmitted

Done

I considered this, but you at least need a custom operator< for that to work, and I don't think it provides a real benefit.

Leporacanthicus: I considered this, but you at least need a custom `operator<` for that to work, and I don't…

} else {

LLVM_DEBUG(llvm::dbgs() << "Too many dimensions\n");

}

if (argsOfInterest.empty())

tblahUnsubmitted

Done

I think we can break here because if this argument of interest was used as operand 0 to the coordinate op, later arguments cannot be the same argument.

tblah: I think we can break here because if this argument of interest was used as operand 0 to the…

return;

struct OpsWithArgs {

mlir::Operation *op;

mlir::SmallVector<ArgInfo> argsAndDims;

};

// Now see if those arguments are used inside any loop.

mlir::SmallVector<OpsWithArgs, 4> loopsOfInterest;

func.walk([&](fir::DoLoopOp loop) {

mlir::Block &body = *loop.getBody();

mlir::SmallVector<ArgInfo> argsInLoop;

kiranchandramohanUnsubmitted

Done

Nit: Would loopsOfInterest be a better name?

kiranchandramohan: Nit: Would `loopsOfInterest` be a better name?

body.walk([&](fir::CoordinateOp op) {

kiranchandramohanUnsubmitted

Done

Nit: Please be more specific in the comment.

kiranchandramohan: Nit: Please be more specific in the comment.

// The current operation could be inside another loop than

// the one we're currently processing. Skip it, we'll get

// to it later.

if (op->getParentOfType<fir::DoLoopOp>() != loop)

kiranchandramohanUnsubmitted

Done

Nit: Why is this called op2 and not just op?

kiranchandramohan: Nit: Why is this called `op2` and not just `op`?

return;

const mlir::Value &operand = op->getOperand(0);

for (auto a : argsOfInterest) {

if (*a.arg == operand) {

// Only add if it's not already in the list.

if (std::find_if(argsInLoop.begin(), argsInLoop.end(), [&](auto it) {

return it.arg == a.arg;

}) == argsInLoop.end()) {

kiranchandramohanUnsubmitted

Done

Nit: Add a comment saying something like "we have to ensure that all the assumed shape arguments used in the loop have unit-stride, so create a predicate that checks the stride of all of them".

kiranchandramohan: Nit: Add a comment saying something like "we have to ensure that all the assumed shape…

argsInLoop.push_back(a);

break;

}

kiranchandramohanUnsubmitted

Done

Nit: expand auto

kiranchandramohan: Nit: expand auto

}

});

if (!argsInLoop.empty()) {

OpsWithArgs ops = {loop, argsInLoop};

loopsOfInterest.push_back(ops);

}

});

if (loopsOfInterest.empty())

return;

// If we get here, there are loops to process.

fir::FirOpBuilder builder{module, kindMap};

mlir::Location loc = builder.getUnknownLoc();

mlir::IndexType idxTy = builder.getIndexType();

LLVM_DEBUG(llvm::dbgs() << "Module Before transformation:");

kiranchandramohanUnsubmitted

Done

Nit: Braces required here.

kiranchandramohan: Nit: Braces required here.

LeporacanthicusAuthorUnsubmitted

Done

Not sure if I put braces in the right place, but I've added some... :)

Leporacanthicus: Not sure if I put braces in the right place, but I've added some... :)

LLVM_DEBUG(module->dump());

LLVM_DEBUG(llvm::dbgs() << "loopsOfInterest: " << loopsOfInterest.size()

<< "\n");

for (auto op : loopsOfInterest) {

LLVM_DEBUG(op.op->dump());

builder.setInsertionPoint(op.op);

tblahUnsubmitted

Not Done

I don't mind whether you decide to change this or not, but there's a convenience interface for this: see FirOpBuilder::IfBuilder

tblah: I don't mind whether you decide to change this or not, but there's a convenience interface for…

LeporacanthicusAuthorUnsubmitted

Done

Not done this.

Leporacanthicus: Not done this.

mlir::Value allCompares = nullptr;

// Ensure all of the arrays are unit-stride.

tblahUnsubmitted

Done

nit: LLVM_DEBUG

tblah: nit: `LLVM_DEBUG`

for (auto &arg : op.argsAndDims) {

fir::SequenceType seqTy = getAsSequenceType(arg.arg);

unsigned rank = seqTy.getDimension();

// We only care about lowest order dimension.

for (unsigned i = 0; i < rank; i++) {

mlir::Value dimIdx = builder.createIntegerConstant(loc, idxTy, i);

arg.dims[i] = builder.create<fir::BoxDimsOp>(loc, idxTy, idxTy, idxTy,

*arg.arg, dimIdx);

}

mlir::Value elemSize =

builder.createIntegerConstant(loc, idxTy, arg.size);

kiranchandramohanUnsubmitted

Done

Nit: Add a comment on why reduction to single index is required. If any issues were faced when retaining the multiple dimensions, then please call out.

kiranchandramohan: Nit: Add a comment on why reduction to single index is required. If any issues were faced when…

mlir::Value cmp = builder.create<mlir::arith::CmpIOp>(

loc, mlir::arith::CmpIPredicate::eq, arg.dims[0].getResult(2),

elemSize);

if (!allCompares) {

allCompares = cmp;

} else {

allCompares =

builder.create<mlir::arith::AndIOp>(loc, cmp, allCompares);

}

auto ifOp =

builder.create<fir::IfOp>(loc, op.op->getResultTypes(), allCompares,

kiranchandramohanUnsubmitted

Not Done

Nit: Please add a test for comparing the dimension of more than one array.

kiranchandramohan: Nit: Please add a test for comparing the dimension of more than one array.

/*withElse=*/true);

builder.setInsertionPointToStart(&ifOp.getThenRegion().front());

LLVM_DEBUG(llvm::dbgs() << "Creating cloned loop\n");

mlir::Operation *clonedLoop = op.op->clone();

kiranchandramohanUnsubmitted

Done

Is using the mlir::indexType necessary here (and probably in other places)? I see more converts in the simplified version that in the original one.

kiranchandramohan: Is using the `mlir::indexType` necessary here (and probably in other places)? I see more…

LeporacanthicusAuthorUnsubmitted

Done

Without the converts, the code won't compile correctly - it gives a weird error message about cannot be converted to LLVM IR: missing LLVMTranslationDialectInterface registration for dialect for op: <something>

It should only add conversion if the types actually need converting. I guess it would be possible to try to reduce the conversions, but it would eventually need conversion to index type anyway.

Leporacanthicus: Without the converts, the code won't compile correctly - it gives a weird error message about…

bool changed = false;

for (auto &arg : op.argsAndDims) {

fir::SequenceType::Shape newShape;

newShape.push_back(fir::SequenceType::getUnknownExtent());

auto elementType = fir::unwrapSeqOrBoxedSeqType(arg.arg->getType());

mlir::Type arrTy = fir::SequenceType::get(newShape, elementType);

mlir::Type boxArrTy = fir::BoxType::get(arrTy);

mlir::Type refArrTy = builder.getRefType(arrTy);

auto carg = builder.create<fir::ConvertOp>(loc, boxArrTy, *arg.arg);

auto caddr = builder.create<fir::BoxAddrOp>(loc, refArrTy, carg);

auto insPt = builder.saveInsertionPoint();

// Use caddr instead of arg.

clonedLoop->walk([&](fir::CoordinateOp coop) {

// Reduce the multi-dimensioned index to a single index.

kiranchandramohanUnsubmitted

Done

builder.restoreInsertionPoint(insPt);

}

- assert(changed && "Expect to find soemthing to change");

+ assert(changed && "Expect to find something to change");

builder.insert(clonedLoop);

kiranchandramohan:

kiranchandramohanUnsubmitted

Done

Nit:The message in the assertion is too generic. Please replace with something specific to the transformation, like did not find any co-ordinate ops.

kiranchandramohan: Nit:The message in the assertion is too generic. Please replace with something specific to the…

// This is required becase fir arrays do not support multiple dimensions

// with unknown dimensions at compile time.

if (coop->getOperand(0) == *arg.arg &&

kiranchandramohanUnsubmitted

Not Done

Nit: we need a test for the multi-dimension to single dimension reduction.

kiranchandramohan: Nit: we need a test for the multi-dimension to single dimension reduction.

coop->getOperands().size() >= 2) {

builder.setInsertionPoint(coop);

mlir::Value totalIndex = builder.createIntegerConstant(loc, idxTy, 0);

tblahUnsubmitted

Done

LLVM_DEBUG

tblah: `LLVM_DEBUG`

LeporacanthicusAuthorUnsubmitted

Done

I changed it to an assert - it ought to not happen! ;) We probably want to know if it DOES happen!

Leporacanthicus: I changed it to an assert - it ought to not happen! ;) We probably want to know if it DOES…

// Operand(1) = array; Operand(2) = index1; Operand(3) = index2

for (unsigned i = coop->getOperands().size() - 1; i > 1; i--) {

mlir::Value curIndex =

builder.createConvert(loc, idxTy, coop->getOperand(i));

// First arg is Operand2, so dims[i-2] is 0-based i-1!

mlir::Value scale =

builder.createConvert(loc, idxTy, arg.dims[i - 2].getResult(1));

totalIndex = builder.create<mlir::arith::AddIOp>(

loc, totalIndex,

builder.create<mlir::arith::MulIOp>(loc, scale, curIndex));

}

totalIndex = builder.create<mlir::arith::AddIOp>(

loc, totalIndex,

builder.createConvert(loc, idxTy, coop->getOperand(1)));

auto newOp = builder.create<fir::CoordinateOp>(

loc, builder.getRefType(elementType), caddr,

mlir::ValueRange{totalIndex});

LLVM_DEBUG(newOp->dump());

coop->getResult(0).replaceAllUsesWith(newOp->getResult(0));

coop->erase();

changed = true;

}

});

builder.restoreInsertionPoint(insPt);

}

assert(changed && "Expected operations to have changed");

builder.insert(clonedLoop);

// Forward the result(s), if any, from the loop operation to the

mlir::ResultRange results = clonedLoop->getResults();

bool hasResults = (results.size() > 0);

kiranchandramohanUnsubmitted

Done

Nit: Can we add a comment for why this ResultOP is needed here and below? Propagate the results from the If?

I think there are cases where there are no results in the loop, it might not be correct to insert one here in that case. This happens in cases where the loop is generated.

subroutine test3(x, y)
  integer :: y(:)
  integer :: x(:)
  read(*,*) x(y)
end subroutine

kiranchandramohan: Nit: Can we add a comment for why this ResultOP is needed here and below? Propagate the results…

if (hasResults)

builder.create<fir::ResultOp>(loc, results);

kiranchandramohanUnsubmitted

Done

Nit: Could you check the following test?

subroutine test4(a, b, n1, m1)
  real :: a(:)  
  real :: b(:,:)
          
  a = [ ((b(i,j), j=1,n1,m1), i=1,n1,m1) ]
end subroutine test4

kiranchandramohan: Nit: Could you check the following test? ``` subroutine test4(a, b, n1, m1) real :: a(:)…

// Add the original loop in the else-side of the if operation.

builder.setInsertionPointToStart(&ifOp.getElseRegion().front());

replaceOuterUses(op.op, ifOp);

op.op->remove();

builder.insert(op.op);

// Rely on "cloned loop has results, so original loop also has results".

if (hasResults) {

builder.create<fir::ResultOp>(loc, op.op->getResults());

} else {

// Use an assert to check this.

assert(op.op->getResults().size() == 0 &&

"Weird, the cloned loop doesn't have results, but the original "

"does?");

}

LLVM_DEBUG(llvm::dbgs() << "After transform:\n");

LLVM_DEBUG(module->dump());

LLVM_DEBUG(llvm::dbgs() << "=== End " DEBUG_TYPE " ===\n");

}

std::unique_ptr<mlir::Pass> fir::createLoopVersioningPass() {

return std::make_unique<LoopVersioningPass>();

}

flang/test/Transforms/loop-versioning.fir

This file was added.

				// RUN: fir-opt --loop-versioning %s \| FileCheck %s


				// subroutine sum1d(a, n)
				// real*8 :: a(:)
				// integer :: n
				// real*8 :: sum
				// integer :: i
				// sum = 0
				// do i=1,n
				// sum = sum + a(i)
				// end do
				// end subroutine sum1d
				module {
				func.func @sum1d(%arg0: !fir.box<!fir.array<?xf64>> {fir.bindc_name = "a"}, %arg1: !fir.ref<i32> {fir.bindc_name = "n"}) {
				%0 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QMmoduleFsum1dEi"}
				%1 = fir.alloca f64 {bindc_name = "sum", uniq_name = "_QMmoduleFsum1dEsum"}
				%cst = arith.constant 0.000000e+00 : f64
				fir.store %cst to %1 : !fir.ref<f64>
				%c1_i32 = arith.constant 1 : i32
				%2 = fir.convert %c1_i32 : (i32) -> index
				%3 = fir.load %arg1 : !fir.ref<i32>
				%4 = fir.convert %3 : (i32) -> index
				%c1 = arith.constant 1 : index
				%5 = fir.convert %2 : (index) -> i32
				%6:2 = fir.do_loop %arg2 = %2 to %4 step %c1 iter_args(%arg3 = %5) -> (index, i32) {
				fir.store %arg3 to %0 : !fir.ref<i32>
				%7 = fir.load %1 : !fir.ref<f64>
				%8 = fir.load %0 : !fir.ref<i32>
				%9 = fir.convert %8 : (i32) -> i64
				%c1_i64 = arith.constant 1 : i64
				%10 = arith.subi %9, %c1_i64 : i64
				%11 = fir.coordinate_of %arg0, %10 : (!fir.box<!fir.array<?xf64>>, i64) -> !fir.ref<f64>
				%12 = fir.load %11 : !fir.ref<f64>
				%13 = arith.addf %7, %12 fastmath<contract> : f64
				fir.store %13 to %1 : !fir.ref<f64>
				%14 = arith.addi %arg2, %c1 : index
				%15 = fir.convert %c1 : (index) -> i32
				%16 = fir.load %0 : !fir.ref<i32>
				%17 = arith.addi %16, %15 : i32
				fir.result %14, %17 : index, i32
				}
				fir.store %6#1 to %0 : !fir.ref<i32>
				return
				}

				// Note this only checks the expected transformation, not the entire generated code:
				// CHECK-LABEL: func.func @sum1d(
				// CHECK-SAME: %[[ARG0:.]]: !fir.box<!fir.array<?xf64>> {{.}})
				// CHECK: %[[ZERO:.*]] = arith.constant 0 : index
				// CHECK: %[[DIMS:.]]:3 = fir.box_dims %[[ARG0]], %[[ZERO]] : {{.}}
				// CHECK: %[[SIZE:.*]] = arith.constant 8 : index
				// CHECK: %[[CMP:.*]] = arith.cmpi eq, %[[DIMS]]#2, %[[SIZE]]
				// CHECK: %[[IF_RES:.]]:2 = fir.if %[[CMP]] -> {{.}}
				// CHECK: %[[NEWARR:.*]] = fir.convert %[[ARG0]]
				// CHECK: %[[BOXADDR:.]] = fir.box_addr %[[NEWARR]] : {{.}} -> !fir.ref<!fir.array<?xf64>>
				// CHECK: %[[LOOP_RES:.]]:2 = fir.do_loop {{.}}
				// CHECK: %[[COORD:.]] = fir.coordinate_of %[[BOXADDR]], %{{.}} : (!fir.ref<!fir.array<?xf64>>, index) -> !fir.ref<f64>
				// CHECK: %{{.*}} = fir.load %[[COORD]] : !fir.ref<f64>
				// CHECK: fir.result %{{.}}, %{{.}}
				// CHECK: }
				// CHECK fir.result %[[LOOP_RES]]#0, %[[LOOP_RES]]#1
				// CHECK: } else {
				// CHECK: %[[LOOP_RES2:.]]:2 = fir.do_loop {{.}}
				// CHECK: %[[COORD2:.]] = fir.coordinate_of %[[ARG0]], %{{.}} : (!fir.box<!fir.array<?xf64>>, i64) -> !fir.ref<f64>
				// CHECK: %{{.*}}= fir.load %[[COORD2]] : !fir.ref<f64>
				// CHECK: fir.result %{{.}}, %{{.}}
				// CHECK: }
				// CHECK fir.result %[[LOOP_RES2]]#0, %[[LOOP_RES2]]#1
				// CHECK: }
				// CHECK: fir.store %[[IF_RES]]#1 to %{{.*}}
				// CHECK: return

				// -----

				// Test that loop-versioning pass doesn't expand known size arrays.
				func.func @sum1dfixed(%arg0: !fir.ref<!fir.array<?xf64>> {fir.bindc_name = "a"}, %arg1: !fir.ref<i32> {fir.bindc_name = "n"}) {
				%0 = fir.alloca i32 {bindc_name = "i", uniq_name = "_QFsum1dfixedEi"}
				%1 = fir.alloca f64 {bindc_name = "sum", uniq_name = "_QFsum1dfixedEsum"}
				%cst = arith.constant 0.000000e+00 : f64
				fir.store %cst to %1 : !fir.ref<f64>
				%c1_i32 = arith.constant 1 : i32
				%2 = fir.convert %c1_i32 : (i32) -> index
				%3 = fir.load %arg1 : !fir.ref<i32>
				%4 = fir.convert %3 : (i32) -> index
				%c1 = arith.constant 1 : index
				%5 = fir.convert %2 : (index) -> i32
				%6:2 = fir.do_loop %arg2 = %2 to %4 step %c1 iter_args(%arg3 = %5) -> (index, i32) {
				fir.store %arg3 to %0 : !fir.ref<i32>
				%7 = fir.load %1 : !fir.ref<f64>
				%8 = fir.load %0 : !fir.ref<i32>
				%9 = fir.convert %8 : (i32) -> i64
				%c1_i64 = arith.constant 1 : i64
				%10 = arith.subi %9, %c1_i64 : i64
				%11 = fir.coordinate_of %arg0, %10 : (!fir.ref<!fir.array<?xf64>>, i64) -> !fir.ref<f64>
				%12 = fir.load %11 : !fir.ref<f64>
				%13 = arith.addf %7, %12 fastmath<contract> : f64
				fir.store %13 to %1 : !fir.ref<f64>
				%14 = arith.addi %arg2, %c1 : index
				%15 = fir.convert %c1 : (index) -> i32
				%16 = fir.load %0 : !fir.ref<i32>
				%17 = arith.addi %16, %15 : i32
				fir.result %14, %17 : index, i32
				}
				fir.store %6#1 to %0 : !fir.ref<i32>
				return
				}

				// CHECK-LABEL: func.func @sum1dfixed(
				// CHECK-SAME: %[[ARG0:.]]: !fir.ref<!fir.array<?xf64>> {{.}})
				// CHECK: fir.do_loop {{.*}}
				// CHECK: %[[COORD:.]] = fir.coordinate_of %[[ARG0]], {{.}}
				// CHECK: %{{.*}} = fir.load %[[COORD]]

				kiranchandramohanUnsubmitted Not Done Reply Inline Actions Nit: Would it be better to check that there are not two `fir.do_loop`? kiranchandramohan: Nit: Would it be better to check that there are not two `fir.do_loop`?
				} // End module

This is an archive of the discontinued LLVM Phabricator instance.

Add loop-versioning pass to improve unit-strideClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 514567

flang/include/flang/Optimizer/Transforms/Passes.h

flang/include/flang/Optimizer/Transforms/Passes.td

flang/include/flang/Tools/CLOptions.inc

flang/lib/Optimizer/Transforms/CMakeLists.txt

flang/lib/Optimizer/Transforms/LoopVersioning.cpp

flang/test/Transforms/loop-versioning.fir

Add loop-versioning pass to improve unit-stride
ClosedPublic