Introduce conditional code to identify stride of "one element", and simplify the array accesses for that case.
This allows better loop performance in various benchmarks.
Differential D141306
Add loop-versioning pass to improve unit-stride Leporacanthicus on Jan 9 2023, 10:25 AM. Authored by
Details Introduce conditional code to identify stride of "one element", and simplify the array accesses for that case. This allows better loop performance in various benchmarks.
Diff Detail
Event TimelineThere are a very large number of changes, so older changes are hidden. Show Older Changes
Comment Actions I think it needs rebasing. I will do that today, but I'm not sure whether I can fix all the comments from Kiran before I go on leave for two weeks. Most are trivial, but still takes time (and I've got a bug that I've been working on the past week that is higher priority). Comment Actions I have updated to the latest llvm/main branch, so should apply now. Note that I have intentionally NOT fixed any of the review comments, I didn't want to fix a small number and leave the rest unresolved. I will fix that when I'm back from my break.
Comment Actions The loop versioning is inefficient for the following scenario, in which not all arrays are contiguous in a loop. Maybe there should be one cost model to perform the loop versioning, and be refined driven by the cases in real workloads? subroutine vadd(c, a, b, n) implicit none real :: c(:), a(:), b(:) integer :: i, n print *, is_contiguous(b) ! classic-flang is wrong here do i = 1, n c(i) = 5. * a(i) + 3. * b(i) c(i) = c(i) * a(i) + a(i) * 4. end do end interface subroutine vadd(c, a, b, n) implicit none integer :: n real :: c(:), a(:), b(:) end end interface integer, parameter :: ub = 10 real :: a(ub) = (/ (i, i=1, 10) /) real :: b(ub*2) = (/ (i, i=1, 20) /) real :: c(ub) = 0. call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10) print *, c end BTW, classic-flang use copy-in-copy-out in caller for assumed-shape array, and all the strides are removed in the callee. Comment Actions Thanks @peixin for the comment. Mats is away and is only back in April. Comment Actions
OK. We can continue the discussion when Mats is back.
Comment Actions So, I will have a closer look at this, but IF the call was made with a contiguous array, it would benefit a fair amount, since it vectorizes the loop - I will try to do a compare "with and without" benchmark thing.
Comment Actions
Sorry. This case should be special case, and can be the future work. This work is a good start, and I am inspired by it. Thanks.
Comment Actions <snip code sample>
I combined my existing benchmark for 1D loop versioning (here: https://discourse.llvm.org/t/rfc-loop-versioning-for-unit-stride/68605/2), so that we run the vadd 250000 times in a loop, with a 4000 element array size with b as unit-stride (contiguous) and b as 2 * unit stride. The result with loop versioning, when the array is unit-stride is 0.33s, with non-unit stride it is 1.2s [for my x86-64 desktop machine]. Without loop versioning, all three variants take about 1.2s. That makes the vectorized (thanks to loop versioning) version 73% faster (or 3.6x faster). I also ran the same benchmark, with a size of 10 instead of 4000 (and correspondingly more loops to . The result is STILL better for the versioned loop, but much lower than the large array (for many reasons, loop overhead, not able to use vector instructions for all elements, etc). With loop versioning: 0.76s for unit-stride, 1.2s for non-unit stride. Without loop versioning, the result is 1.2s for both scenarios (well, all three cases - but two of those are identical, just passing b and c into the function). For the benchmarks I did, I couldn't get a measurable slow-down for the versioned loop in the non-unit-stride variant - it may be measurable for a different case. [The call does get inlined in this benchmark, someting I actively tried to avoid in my original benchmarks]. Here's the initial, 4000 element version of the code: subroutine vadd(c, a, b, n) implicit none real :: c(:), a(:), b(:) integer :: i, n do i = 1, n c(i) = 5. * a(i) + 3. * b(i) c(i) = c(i) * a(i) + a(i) * 4. end do end subroutine vadd subroutine do_bench(msg, cc, aa, bb, size) interface subroutine vadd(c, a, b, n) implicit none integer :: n real :: c(:), a(:), b(:) end subroutine vadd end interface character(*) :: msg integer, parameter :: loops = 250000 integer:: size real :: aa(1:) real :: bb(1:) real :: cc(1:) real*8 :: time, time_start, time_end integer::i call CPU_TIME(time_start) do i = 1, loops call vadd(cc, aa, bb, size) end do call CPU_TIME(time_end) time = time_end - time_start print "(A12, F8.5, A2)", msg, time, " s" end subroutine do_bench interface subroutine vadd(c, a, b, n) implicit none integer :: n real :: c(:), a(:), b(:) end subroutine vadd subroutine do_bench(msg, cc, aa, bb, size) character(*) :: msg integer:: size real :: aa(1:) real :: bb(1:) real :: cc(1:) end end interface integer, parameter :: ub = 10 real :: a(ub) = (/ (i, i=1, 10) /) real :: b(ub*2) = (/ (i, i=1, 20) /) real :: c(ub) = 0. integer, parameter :: size = 4000 real :: aa(size) real :: bb(size) real :: cc(size * 2) real :: res(size) call vadd(c(1:ub), a(1:ub), b(1:ub:2), 10) print *, c aa = 1 bb = 2 cc = 3 res = 0 call do_bench("a + b", res, aa, bb, size) call do_bench("a + c", res, aa, cc, size) call do_bench("a + c(::2)", res, aa, cc(::2), size) end For the 10 element version, just replace the looops = 250000 with loops = 250000 * 400, and size = 4000 with size = 10. Number of loops is choosen to make a "reasonable amount of time on my machine" - I try to aim for benchmarks of this kind that run around 0.5-1.5s, and I just used the values from my original benchmark to start with,. Yes, the vadd code is definitely quite a bit longer - about 3.4 times (428 vs 126 bytes) for my x86-64 code - this will clearly vary depending on the processor architecture, so Arm may get a different result. Much of that is due to the fact that the loop also gets vectorized, which expands the original code a fair bit, because THAT also has to include the code to resolve the last few elements if the size isn't a multiple of vector size. Being able to vectorize loops is definitely one of the goals here. As with pretty much any optimisation, not everything will be positive. In this case, larger code-size, and possibly also some cases of "marginally slower". I've not found any case where it is actually slower, but I'm sure there are cases where this happens. [A couple of years ago, I was working on a hobby project compiler, where I called a runtime function to count number of bits "set" in an array of i32 elements, which, where the compiler decided that "oh, you probably do this on large arrays, so I'll vectorize the loop" - for the application that I was working on, the size of the array was always 1 - so it was bady punished by the "check if we can use vector instructions" - I can't remember if I tried to fix the function, or went straight to "form the correct code inline instead" - now that project has an inline count bits, either way]. (I have seen the reply to my previous comment, but since I'd almost written all this already, I completed it) Comment Actions
Comment Actions Thanks for the update. The whole design looks good to me now. Thanks for the work.
Comment Actions
Comment Actions LGTM. Thanks for working on this and addressing all the comments. I am requesting further tests, Nits and also some additional checks to avoid failures when loops are generated.
Comment Actions Review comment changes:
|