Introduce conditional code to identify stride of "one element", and simplify the array accesses for that case.
This allows better loop performance in various benchmarks.
Differential D141306
Add loop-versioning pass to improve unit-stride Leporacanthicus on Jan 9 2023, 10:25 AM. Authored by
Details Introduce conditional code to identify stride of "one element", and simplify the array accesses for that case. This allows better loop performance in various benchmarks.
Diff Detail
Event TimelineComment Actions Hi @Leporacanthicus, thank you for working on this! I wonder if you target any particular benchmark with these changes - can you please post any information in the description? While I was comparing different compilers on SPEC CPU2000, 2006 and 2017 and Polyhedron, I believe I only saw one case where the missing "continuity" information affected a hotspot - this was in CPU2000/187.facerec, and the estimated improvement was ~14%. This was in graphRoutines.f90::LocalMove where the 2-level loop nest for the multiplication+reduction could be collapsed and vectorized with unit stride: 274 Subroutine LocalMove (Params, Position, Graph, GaborTrafo, & 275 Similarity, Hops, Sweeps) ... 289 Real(4), Intent(In) :: Graph (:, :, :) 290 Real(4), Intent(In) :: GaborTrafo (:, :, :, :) ... 392 NewJetSim = SUM (Graph (:, :, IG) * & 393 & GaborTrafo (:, :, NewX, NewY)) I wonder if you found more cases where this optimization may have significant impact. I think you want to replace "assumed size array" with "assumed shape array" in the comments, since the assumed-size arrays should be contiguous by definition. Comment Actions This is work described here: As mentioned, this (we believe) will benefit the Spec2017FP roms_r benchmark - probably others too. At the moment, it's not really working even for simple cases, so that's a bit of an issue, but hopefully we can get there... :) Comment Actions Added "move remaining loop into else branch" and related changes. Still doesn't actually work as intended. Comment Actions Some progress, now identifying the correct loops a little better. Support for 2D arrays, but not "any dimensions". This patch does not work correctly, it needs further work.
Comment Actions Thanks for implementing this. Why are only function arguments considered? What about unknown sized arrays from other sources?
Comment Actions Updated to reflect review comments.
Comment Actions I am thinking if we should do the loop versioning in MLIR since this may increase compilation time in optimization passes. The loop vectorize pass can support the versioning in some cases. Check this case: https://github.com/llvm/llvm-project/blob/main/llvm/test/Transforms/LoopVectorize/version-mem-access.ll. It is SCEV analysis combined with runtime check (https://github.com/llvm/llvm-project/blob/c11c2f5f6548a5303517c89ba6629bbfa7fae0d9/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L1943-L1973). The original support is one C example similar to Fortran code: https://github.com/llvm/llvm-project/commit/c2e9d759f29602021daa26453452928c81adffbb. I am not against this method for now since supporting this in Loop vectorization pass is not an easy work.
Comment Actions I have not noticed any measurable increase in compile time - Spec 2017 wrf_r takes within 1-2 seconds, and that's the one that takes about 15 minutes to build in total. I'm not saying it's impossible to come up with something that compiles slowly with this code, but I've made as good an attempt as I can to "exit early" if there's nothing that needs doing, and only duplicate the innermost loop [which potentially is not the most optimal case].
The problem I'm trying to solve here is that the Loop Vectorizer optimisation doesn't "like" the descriptor access to read the strides [and other descriptor accesses] - it doesn't understand that the value is a constant for a given run of a loop. With versioning , it basically makes the Loop Vectorizer "happy" with the loop, so it can do its own thing.
Comment Actions
Great! This sounds good to me.
I think the loop vectorize pass is doing the similar thing. Check the generated LLVM IR the following Fortran case: subroutine sum1d(a, n) real*8 :: a(:) integer :: n real*8 :: sum integer :: i sum = 0 do i=1,n sum = sum + a(i) end do call temp(sum) end subroutine sum1d $ flang-new -fc1 -emit-llvm -O3 test.f90 $ cat test.ll define void @sum1d_(ptr nocapture readonly %0, ptr nocapture readonly %1) local_unnamed_addr { %3 = alloca double, align 8 store double 0.000000e+00, ptr %3, align 8, !tbaa !1 %4 = load i32, ptr %1, align 4, !tbaa !1 %5 = icmp sgt i32 %4, 0 br i1 %5, label %.lr.ph, label %._crit_edge .lr.ph: ; preds = %2 %6 = zext i32 %4 to i64 %7 = load ptr, ptr %0, align 8, !tbaa !5 %8 = getelementptr { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %0, i64 0, i32 7, i64 0, i64 2 %9 = load i64, ptr %8, align 8, !tbaa !5 br label %10 10: ; preds = %.lr.ph, %10 %indvars.iv = phi i64 [ 1, %.lr.ph ], [ %indvars.iv.next, %10 ] %11 = phi double [ 0.000000e+00, %.lr.ph ], [ %16, %10 ] %12 = add nsw i64 %indvars.iv, -1 %13 = mul i64 %9, %12 %14 = getelementptr i8, ptr %7, i64 %13 ! Note: This has the similar structure as the LLVM IR generated from the following C code. %15 = load double, ptr %14, align 8, !tbaa !1 %16 = fadd contract double %11, %15 store double %16, ptr %3, align 8, !tbaa !1 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1 %exitcond.not = icmp eq i64 %indvars.iv, %6 br i1 %exitcond.not, label %._crit_edge, label %10 ._crit_edge: ; preds = %10, %2 call void @temp_(ptr nonnull %3) ret void } When you compare it with the generated IR from the following C case using clang case.c -S -emit-llvm -O3: void temp(double *sum); void test1(double * restrict A, double * restrict B, int N, int Stride) { int i; for (i = 0; i < N; ++i) B[i * Stride] += A[i * Stride]; } void test2(double * restrict A, int N, int Stride) { int i; double sum = 0.0; for (i = 0; i < N; ++i) sum += A[i * Stride]; temp(&sum); } The loop in function test1 can be vectorized using stride-check loop versioning. I don't dig why test2 fail in vectorizing the reduction. Anyway, the loop vectorize pass can do some similar work. But I am not sure if the SCEV analysis there can handle the Fortran case. Comment Actions Yes, I've only spent SOME time trying to understand why vectorizer doesn't - it basically comes down to "can't figure out how the stride is calculated, and whether it may change over time". The loop versioning here is helping to solve that for the lower layers in the compiler - and in my experience (I've written quite a few different style loops and such), either some MLIR pass(es) manages to vectorize the codr, or LLVM loop vectorizer can't do it either. [Why is there two layers of vectorizers? I don't know - presumably people like to write code that does similar things on multiple levels]. Also, the overall goal isn't ONLY to vectorize, but also to "improve overall loop performance". It may be that we can convince the compiler to do a better job in some other way, but right now, I'm not aware of what that should be. The fundamental problem is that the descriptors used to pass array information is getting in the way of lots of things. It should be noted that although we see performance improvement in roms_r, it does NOT come from vectorization of the loop constructs - both gfortran and flang uses loop versioning, and gets similar performance improvement - and neither compiler uses vector instructions in the hot loops in ROMS. Also, if you do something like: template <typename T> struct descr { size_t stride; T* ptr; }; void sum1d(restrict descr<double>* a, int n) { int i; double sum = 0.0; for (i = 0; i < N; ++i) sum += a->ptr[i * a->Stride]; temp(&sum); } The compiler (clang and gcc both) will generate "different" code than the plain passing a pointer + stride. For simple cases, where all the code is "visible" and the compiler can inline and propagate stuff, but if it's compiled separately, it "doesn't know what's going on, and gives up". Comment Actions LGTM, but someone else should approve. Only comment is that the fir test is currently called loop-versioing.fir. Should probably be loop-versioning.fir? Comment Actions
OK.
I am confused about how the performance gets improved for roms_r you mentioned. From the test in this patch, I can only see the optimization opportunity of loop vectorization after loop versioning. Maybe another test case extracted in roms_r showing how it gets performance improvement is better. BTW, can you attach the link of discussion in discourse in commit message. Comment Actions FYI, another related scenario is https://github.com/llvm/llvm-project/issues/59388. I also faced one case having loop interchange opportunity, which is blocked by the Fortran array stride information not understood by SCEV analysis. I think this is one common issue in loop optimization. So there maybe need some discussions whether loop versioning in MLIR pass is the right solution. Comment Actions I compiled the code in the ticket with loop-versioning, which makes this code for the "version where stride matches": .LBB0_17: movslq %r13d, %r13 leaq (%rdx,%r13), %r12 decq %r12 movups (%r9,%r12,4), %xmm0 leaq -1(%r13,%r10), %r12 movups (%r14,%r12,4), %xmm1 leaq -1(%r13,%r11), %r12 movups (%r15,%r12,4), %xmm2 mulps %xmm1, %xmm2 addps %xmm0, %xmm2 leaq (%rbx,%r13), %r12 decq %r12 movups %xmm2, (%r8,%r12,4) addl $4, %r13d addq $-4, %rsi jne .LBB0_17 So, that's using vector instructions. WIth loop versioning turned off: .LBB0_4: movslq %r11d, %r11 leaq -1(%r11), %rax movq %r15, %rdx imulq %rax, %rdx addq %r10, %rdx movss (%rbp,%rdx), %xmm0 movq %rsi, %rdx imulq %rax, %rdx addq %rdi, %rdx mulss (%rcx,%rdx), %xmm0 movq -16(%rsp), %rdx imulq %rax, %rdx addq %r13, %rdx movq -8(%rsp), %r12 addss (%r12,%rdx), %xmm0 imulq %r8, %rax addq %r14, %rax movss %xmm0, (%r9,%rax) incl %r11d decq %rbx cmpq $1, %rbx ja .LBB0_4 This is also good example of how the code is simplified for the access to the arrays - there is no complicated index calculatiuons in the loop. [This is partly because the array is flattened into a 1D array, and partly because we give the compiler is clearer idea of what is going on]. Just basic "count number of instructions" shows that the "versioned" loop has 16 instructiions, vs 22 in the non-versioned loop. I haven't counted clock-cycles per instruction, but just eyeballing it, I'd say it's even more of a difference - and of course, since it was successfully vectorizing, it's ALSO faster because it runs a quarter as many loops. Comment Actions Because when the access to a 1D !fir.ref<!fir.array<*>> is simpler than the access of !fir.box<!fir.array<*>>. See example in the comment with a link to "related ticket".
Will do. Comment Actions Thanks @MatsPetersson for working on this. I have started going through the patch and have some comments.
Comment Actions I think it needs rebasing. I will do that today, but I'm not sure whether I can fix all the comments from Kiran before I go on leave for two weeks. Most are trivial, but still takes time (and I've got a bug that I've been working on the past week that is higher priority). Comment Actions I have updated to the latest llvm/main branch, so should apply now. Note that I have intentionally NOT fixed any of the review comments, I didn't want to fix a small number and leave the rest unresolved. I will fix that when I'm back from my break.
Comment Actions The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads? subroutine vadd(c, a, b, n) implicit none real :: c(:), a(:), b(:) integer :: i, n print *, is_contiguous(b) ! classic-flang is wrong here do i = 1, n c(i) = 5. * a(i) + 3. * b(i) c(i) = c(i) * a(i) + a(i) * 4. end do end interface subroutine vadd(c, a, b, n) implicit none integer :: n real :: c(:), a(:), b(:) end end interface integer, parameter :: ub = 10 real :: a(ub) = (/ (i, i=1, 10) /) real :: b(ub*2) = (/ (i, i=1, 20) /) real :: c(ub) = 0. call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10) print *, c end BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee. Comment Actions Thanks @peixin for the comment. Mats is away and is only back in April. Comment Actions
OK. We can continue the discussion when Mats is back.
Comment Actions So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.
Comment Actions
Sorry. This case should be special case, and can be the future work. This work is a good start, and I am inspired by it. Thanks.
Comment Actions <snip code sample>
I combined my existing benchmark for 1D loop versioning (here: https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605/2), so that we run the vadd 250000 times in a loop, with a 4000 element array size with b as unit-stride (contiguous) and b as 2 * unit stride. The result with loop versioning, when the array is unit-stride is 0.33s, with non-unit stride it is 1.2s [for my x86-64 desktop machine]. Without loop versioning, all three variants take about 1.2s. That makes the vectorized (thanks to loop versioning) version 73% faster (or 3.6x faster). I also ran the same benchmark, with a size of 10 instead of 4000 (and correspondingly more loops to . The result is STILL better for the versioned loop, but much lower than the large array (for many reasons, loop overhead, not able to use vector instructions for all elements, etc). With loop versioning: 0.76s for unit-stride, 1.2s for non-unit stride. Without loop versioning, the result is 1.2s for both scenarios (well, all three cases - but two of those are identical, just passing b and c into the function). For the benchmarks I did, I couldn't get a measurable slow-down for the versioned loop in the non-unit-stride variant - it may be measurable for a different case. [The call does get inlined in this benchmark, someting I actively tried to avoid in my original benchmarks]. Here's the initial, 4000 element version of the code: subroutine vadd(c, a, b, n) implicit none real :: c(:), a(:), b(:) integer :: i, n do i = 1, n c(i) = 5. * a(i) + 3. * b(i) c(i) = c(i) * a(i) + a(i) * 4. end do end subroutine vadd subroutine do_bench(msg, cc, aa, bb, size) interface subroutine vadd(c, a, b, n) implicit none integer :: n real :: c(:), a(:), b(:) end subroutine vadd end interface character(*) :: msg integer, parameter :: loops = 250000 integer:: size real :: aa(1:) real :: bb(1:) real :: cc(1:) real*8 :: time, time_start, time_end integer::i call CPU_TIME(time_start) do i = 1, loops call vadd(cc, aa, bb, size) end do call CPU_TIME(time_end) time = time_end - time_start print "(A12, F8.5, A2)", msg, time, " s" end subroutine do_bench interface subroutine vadd(c, a, b, n) implicit none integer :: n real :: c(:), a(:), b(:) end subroutine vadd subroutine do_bench(msg, cc, aa, bb, size) character(*) :: msg integer:: size real :: aa(1:) real :: bb(1:) real :: cc(1:) end end interface integer, parameter :: ub = 10 real :: a(ub) = (/ (i, i=1, 10) /) real :: b(ub*2) = (/ (i, i=1, 20) /) real :: c(ub) = 0. integer, parameter :: size = 4000 real :: aa(size) real :: bb(size) real :: cc(size * 2) real :: res(size) call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10) print *, c aa = 1 bb = 2 cc = 3 res = 0 call do_bench("a + b", res, aa, bb, size) call do_bench("a + c", res, aa, cc, size) call do_bench("a + c(::2)", res, aa, cc(::2), size) end For the 10 element version, just replace the looops = 250000 with loops = 250000 * 400, and size = 4000 with size = 10. Number of loops is choosen to make a "reasonable amount of time on my machine" - I try to aim for benchmarks of this kind that run around 0.5-1.5s, and I just used the values from my original benchmark to start with,. Yes, the vadd code is definitely quite a bit longer - about 3.4 times (428 vs 126 bytes) for my x86-64 code - this will clearly vary depending on the processor architecture, so Arm may get a different result. Much of that is due to the fact that the loop also gets vectorized, which expands the original code a fair bit, because THAT also has to include the code to resolve the last few elements if the size isn't a multiple of vector size. Being able to vectorize loops is definitely one of the goals here. As with pretty much any optimisation, not everything will be positive. In this case, larger code-size, and possibly also some cases of "marginally slower". I've not found any case where it is actually slower, but I'm sure there are cases where this happens. [A couple of years ago, I was working on a hobby project compiler, where I called a runtime function to count number of bits "set" in an array of i32 elements, which, where the compiler decided that "oh, you probably do this on large arrays, so I'll vectorize the loop" - for the application that I was working on, the size of the array was always 1 - so it was bady punished by the "check if we can use vector instructions" - I can't remember if I tried to fix the function, or went straight to "form the correct code inline instead" - now that project has an inline count bits, either way]. (I have seen the reply to my previous comment, but since I'd almost written all this already, I completed it) Comment Actions
Comment Actions Thanks for the update. The whole design looks good to me now. Thanks for the work.
Comment Actions
Comment Actions LGTM. Thanks for working on this and addressing all the comments. I am requesting further tests, Nits and also some additional checks to avoid failures when loops are generated.
Comment Actions Review comment changes:
|