Page MenuHomePhabricator

Add loop-versioning pass to improve unit-stride
ClosedPublic

Authored by Leporacanthicus on Jan 9 2023, 10:25 AM.

Details

Summary

Introduce conditional code to identify stride of "one element", and simplify the array accesses for that case.

This allows better loop performance in various benchmarks.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
flang/include/flang/Optimizer/Transforms/Passes.td
292
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
64

Nit: spell out auto.

108

Nit: Should this be "Too many dimensions" or "Unsupported type"?

124

Nit: Is this assert really required?

126

Nit: expand auto.

129

Do you mean op2->getParentOfType<fir::DoLoopOp>() is an inner loop contained in loop?

136

Is there a better datastructure for argsOfInterest? like a set or something?

156

Nit: Please be more specific in the comment.

169

Nit: Add a comment saying something like "we have to ensure that all the assumed shape arguments used in the loop have unit-stride, so create a predicate that checks the stride of all of them".

172

Nit: expand auto

186–190

Nit: Braces required here.

217–231

Is using the mlir::indexType necessary here (and probably in other places)? I see more converts in the simplified version that in the original one.

245

Nit:The message in the assertion is too generic. Please replace with something specific to the transformation, like did not find any co-ordinate ops.

I failed to apply this patch locally.

I failed to apply this patch locally.

I think it needs rebasing. I will do that today, but I'm not sure whether I can fix all the comments from Kiran before I go on leave for two weeks. Most are trivial, but still takes time (and I've got a bug that I've been working on the past week that is higher priority).

I failed to apply this patch locally.

I have updated to the latest llvm/main branch, so should apply now.

Note that I have intentionally NOT fixed any of the review comments, I didn't want to fix a small number and leave the rest unresolved. I will fix that when I'm back from my break.

peixin added inline comments.Mar 20 2023, 1:44 AM
flang/include/flang/Tools/CLOptions.inc
285–289

I still failed to apply this patch. There is no such code in current main branch.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

Thanks @peixin for the comment. Mats is away and is only back in April.
We discussed what other compilers do in https://discourse.llvm.org/t/transformations-to-aid-optimizer-for-subroutines-functions-with-assumed-shape-arguments/66447. As you can imagine and as pointed out in one of the comments in the discourse thread, always creating copies only benefits if the called subroutine can take advantage of it and can amortise the cost of creating the copy.

Thanks @peixin for the comment. Mats is away and is only back in April.

OK. We can continue the discussion when Mats is back.

Leporacanthicus marked 12 inline comments as done.
  • Rebase
  • Fix review comments
Leporacanthicus added inline comments.Apr 4 2023, 5:24 AM
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
108

Probably "both". :)

136

Maybe. I don't expect there to be MANY of these, and we would have to declare a comparison operators for the argInfo to make that work. Not convinced using llvm::SmallSet (or mlir::SmallSet is that useful).

186–190

Not sure if I put braces in the right place, but I've added some... :)

217–231

Without the converts, the code won't compile correctly - it gives a weird error message about cannot be converted to LLVM IR: missing LLVMTranslationDialectInterface registration for dialect for op: <something>

It should only add conversion if the types actually need converting. I guess it would be possible to try to reduce the conversions, but it would eventually need conversion to index type anyway.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  print *, is_contiguous(b)  ! classic-flang is wrong here
  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c
end

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

flang/include/flang/Tools/CLOptions.inc
285–289

I'm not sure exactly which part you are referring to "not in main" - there is a dependent patch in D141307 that introduces the loopVersioning option itself - is that what's missing, or am I completely missing something?

Reference this:
https://github.com/llvm/llvm-project/blob/main/flang/include/flang/Tools/CLOptions.inc#L285

flang/test/Transforms/loop-versioing.fir
76 ↗(On Diff #503460)
116 ↗(On Diff #503460)

Doh, missed the test comments. Have fixed locally, will update later.

SBallantyne added inline comments.Apr 4 2023, 8:56 AM
flang/test/Transforms/loop-versioing.fir
1 ↗(On Diff #510757)

Nit: Typo, filename should loop-versioning.fir

peixin added a comment.Apr 4 2023, 9:00 AM

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

Sorry. This case should be special case, and can be the future work.

This work is a good start, and I am inspired by it. Thanks.

peixin added a comment.Apr 4 2023, 9:06 AM

Strange. One of my inlined comment is lost.

peixin added inline comments.Apr 4 2023, 9:16 AM
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
13

I think there should be more descriptions about purpose, considerations, and TODOs
.

  1. This pass actually does two works: prepare for loop vectorize and expression sp

ecialization where the expression is array stride.

For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future.
For expression specialization, extending to more dimensions and other arrays sho

uld be future work. The unit stride should be common in Fortran.

  1. There should be one cost model in future, which can be described as TODO. I tri

ed the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call
in the loop, gfortran does not do the loop versioning. If there is an if statement
, gfortran does it. It seems that gfortran does not perform the loop versioning un
conditionally.

  1. The code size problem should also be considered. What about the option -Os/Oz?

What about if the loop is very large?

  1. Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do yo

u consider doing the loop versioning at the end of LLVM middle-end optimization pa
ss? If so, classic-flang may also benefit from it considering LLVM Flang is still
not ready for commercial usage.

The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads?

<snip code sample>

BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee.

So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.

I combined my existing benchmark for 1D loop versioning (here: https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605/2), so that we run the vadd 250000 times in a loop, with a 4000 element array size with b as unit-stride (contiguous) and b as 2 * unit stride. The result with loop versioning, when the array is unit-stride is 0.33s, with non-unit stride it is 1.2s [for my x86-64 desktop machine]. Without loop versioning, all three variants take about 1.2s. That makes the vectorized (thanks to loop versioning) version 73% faster (or 3.6x faster).

I also ran the same benchmark, with a size of 10 instead of 4000 (and correspondingly more loops to . The result is STILL better for the versioned loop, but much lower than the large array (for many reasons, loop overhead, not able to use vector instructions for all elements, etc). With loop versioning: 0.76s for unit-stride, 1.2s for non-unit stride. Without loop versioning, the result is 1.2s for both scenarios (well, all three cases - but two of those are identical, just passing b and c into the function). For the benchmarks I did, I couldn't get a measurable slow-down for the versioned loop in the non-unit-stride variant - it may be measurable for a different case. [The call does get inlined in this benchmark, someting I actively tried to avoid in my original benchmarks].

Here's the initial, 4000 element version of the code:

subroutine vadd(c, a, b, n)
  implicit none
  real :: c(:), a(:), b(:)
  integer :: i, n

  do i = 1, n
    c(i) = 5. * a(i) + 3. * b(i)
    c(i) = c(i) * a(i) + a(i) * 4.
  end do
end subroutine vadd

subroutine do_bench(msg, cc, aa, bb, size)
  interface
     subroutine vadd(c, a, b, n)
       implicit none
       integer :: n
       real :: c(:), a(:), b(:)
     end subroutine vadd
  end interface
  character(*) :: msg
  integer, parameter :: loops = 250000
  integer:: size
  real :: aa(1:)
  real :: bb(1:)
  real :: cc(1:)
    
  real*8 :: time, time_start, time_end
  integer::i

  call CPU_TIME(time_start)
  do i = 1, loops
     call vadd(cc, aa, bb, size)
  end do
  call CPU_TIME(time_end)
  time = time_end - time_start
  print "(A12, F8.5, A2)", msg, time, " s"
end subroutine do_bench

interface
  subroutine vadd(c, a, b, n)
    implicit none
    integer :: n
    real :: c(:), a(:), b(:)
  end subroutine vadd

  subroutine do_bench(msg, cc, aa, bb, size)
    character(*) :: msg
    integer:: size
    real :: aa(1:)
    real :: bb(1:)
    real :: cc(1:)
  end
end interface

  integer, parameter :: ub = 10
  real :: a(ub) = (/ (i, i=1, 10) /)
  real :: b(ub*2) = (/ (i, i=1, 20) /)
  real :: c(ub) = 0.

  integer, parameter :: size = 4000
  real :: aa(size)
  real :: bb(size)
  real :: cc(size * 2)
  real :: res(size)
  
  call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10)
  print *, c

  aa = 1
  bb = 2
  cc = 3
  res = 0

  call do_bench("a + b", res, aa, bb, size)
  call do_bench("a + c", res, aa, cc, size)
  call do_bench("a + c(::2)", res, aa, cc(::2), size)
  
end

For the 10 element version, just replace the looops = 250000 with loops = 250000 * 400, and size = 4000 with size = 10. Number of loops is choosen to make a "reasonable amount of time on my machine" - I try to aim for benchmarks of this kind that run around 0.5-1.5s, and I just used the values from my original benchmark to start with,.

Yes, the vadd code is definitely quite a bit longer - about 3.4 times (428 vs 126 bytes) for my x86-64 code - this will clearly vary depending on the processor architecture, so Arm may get a different result. Much of that is due to the fact that the loop also gets vectorized, which expands the original code a fair bit, because THAT also has to include the code to resolve the last few elements if the size isn't a multiple of vector size. Being able to vectorize loops is definitely one of the goals here.

As with pretty much any optimisation, not everything will be positive. In this case, larger code-size, and possibly also some cases of "marginally slower". I've not found any case where it is actually slower, but I'm sure there are cases where this happens. [A couple of years ago, I was working on a hobby project compiler, where I called a runtime function to count number of bits "set" in an array of i32 elements, which, where the compiler decided that "oh, you probably do this on large arrays, so I'll vectorize the loop" - for the application that I was working on, the size of the array was always 1 - so it was bady punished by the "check if we can use vector instructions" - I can't remember if I tried to fix the function, or went straight to "form the correct code inline instead" - now that project has an inline count bits, either way].

(I have seen the reply to my previous comment, but since I'd almost written all this already, I completed it)

Leporacanthicus marked 4 inline comments as done.
  • Change wording in several comments.
  • Change one "auto" to real type.
  • Rename misspelled filename and add newline at the end of the file
Leporacanthicus marked an inline comment as done.Apr 6 2023, 7:49 AM
Leporacanthicus added inline comments.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
13

Thanks a lot for your in depth comments. I have tried my best to answer below, and updated some of the comment explanation to the pass.

I think there should be more descriptions about purpose, considerations, and TODOs
.

  1. This pass actually does two works: prepare for loop vectorize and expression specialization where the expression is array stride.

I have added more text t to the description of the pass in general, describing some of what we've discussed in various comments on this review, which should make it clearer what the pass does, why it is needed, etc.

For loop vectorize, I tried to extend LLVM LoopVectorize pass in D147539 to handle this case and pointer arrays, which contains the cost model and legality checks. I think also doing it here is good. Both of methods needs more investigations and works in future.

From what I can tell, the problem is that loop vectorizer doesn't understand how Fortran descriptors work, and just sees them as structures that contain some data used by the loop - which makes it decide that "I have no idea what the stride is, so I can't vectorize this". I looked into, but not VERY hard, solving the loop vectorizer problem, but I couldn't get my head around how you'd figure out when to try to do this and when not. In FIR, we can know this is a descriptor for an array.

  1. There should be one cost model in future, which can be described as TODO.

Thanks, I have added a TODO to that effect.

I tried the case (https://gcc.godbolt.org/z/Tej6GGvzd), if there is non-intrinsic call
in the loop, gfortran does not do the loop versioning. If there is an if statement, gfortran does it. It seems that gfortran does not perform the loop versioning unconditionally.

I'm not entirely sure why you wouldn't do this just because there is a call in there. Do you know why it doesn't? Of course, if someone is calling, say, print *, a (where a is a 1000+ element array), it's probably not a great ideal to transform that loop.

I have added a TODO to this effect.

  1. The code size problem should also be considered. What about the option -Os/Oz?

What about if the loop is very large?

I believe -Os/-Oz is not currently supported in Flang (-Os didn't work when I just tried it, and I'm sure there was a discourse thread on it just a few days back), but the code for that (in D141307, file clang/lib/Driver/ToolChains/Flang.cpp lines 69-76) should return false for "not a number". It is only enabled for > O2 [along with -Ofast] - or if someone specifically asks for the -fversion-loops-for-stride.

A limit for large loops is probably not a bad thing. Where to set that limit, I haven't got much of an idea...

  1. Considering the loop vectorize can be handled in LLVM LoopVectorize pass, do you consider doing the loop versioning at the end of LLVM middle-end optimization pass? If so, classic-flang may also benefit from it considering LLVM Flang is still not ready for commercial usage.

As discussed above, it's the problem of "understanding what the code does" that I talked about above.

Classic flang relies on the "copy to a contiguous array", so there wouldn't be any benefit [unless we remove that part - I have no idea how easy/hard it would be to do that].

129

Rewritten the comment to make it clearer.

Update moved header file and include kindmapping that we rely on.

Add size to SmallVector to allow Windows build.

peixin added a comment.Apr 7 2023, 2:20 AM

Thanks for the update. The whole design looks good to me now. Thanks for the work.

flang/include/flang/Optimizer/Transforms/Passes.td
292

nit: American english is suggested. You used it in LoopVersioning.cpp.

294

American english is suggested.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2

nit

17
27

Is this one editor problem? wield indentation, also for the following.

29
Leporacanthicus marked 4 inline comments as done.
  • Review comment updates. All are comment wording and formatting, no actual code changes
  • Rebase, which includes trivial conflict in Passes.td
flang/include/flang/Optimizer/Transforms/Passes.td
292

Having spent far too much time reading and writing a mix of British and American English, I never quite know which I'm actually using - or should use... ;)

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
2

I removed the brief description - there's an actual description further down, and I don't think there's much point in having a one-line [that isn't quite one line] description too. There is a mixture of the two approaches, but I think reading the long comment 10 lines further down is good enough.

2–3

I have changed it, removing the "too long" description altogether, see above.

27

No, it's me running clang-format and not spotting that it's modified some comments to not look good. ;)

[I guess it would be POSSIBLE to set the editor to break lines too! Or to run clang-format on a region]

Leporacanthicus marked 5 inline comments as done.Apr 14 2023, 10:00 AM
Leporacanthicus added inline comments.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
195–198

Not done this.

Leporacanthicus marked 10 inline comments as done.Apr 14 2023, 10:14 AM
Leporacanthicus added inline comments.
flang/lib/Optimizer/Transforms/LoopVersioning.cpp
136

I considered this, but you at least need a custom operator< for that to work, and I don't think it provides a real benefit.

Fix unnecessary curly braces

LGTM. Thanks for working on this and addressing all the comments.

I am requesting further tests, Nits and also some additional checks to avoid failures when loops are generated.

flang/lib/Optimizer/Transforms/LoopVersioning.cpp
37
40
155

Nit: Would loopsOfInterest be a better name?

160

Nit: Why is this called op2 and not just op?

203–226

Nit: Please add a test for comparing the dimension of more than one array.

248

Nit: we need a test for the multi-dimension to single dimension reduction.

285

Nit: Can we add a comment for why this ResultOP is needed here and below? Propagate the results from the If?

I think there are cases where there are no results in the loop, it might not be correct to insert one here in that case. This happens in cases where the loop is generated.

subroutine test3(x, y)
  integer :: y(:)
  integer :: x(:)
  read(*,*) x(y)
end subroutine
288

Nit: Could you check the following test?

subroutine test4(a, b, n1, m1)
  real :: a(:)  
  real :: b(:,:)
          
  a = [ ((b(i,j), j=1,n1,m1), i=1,n1,m1) ]
end subroutine test4
flang/test/Transforms/loop-versioning.fir
110–114

Nit: Would it be better to check that there are not two fir.do_loop?

Leporacanthicus marked 4 inline comments as done.

Review comment changes:

  • Add checks for "no result" from the operations
  • Fix code generated for array initializers
  • Change the names of op2 -> op, and opsOfInterest -> loopsOfInterest
  • Some typos in comments fixed
Leporacanthicus marked 2 inline comments as done.Apr 17 2023, 2:24 PM
Leporacanthicus marked an inline comment as done.Apr 17 2023, 2:37 PM
This revision was automatically updated to reflect the committed changes.