This is an archive of the discontinued LLVM Phabricator instance.

[LV] Cost model for out-of-loop reductions
AcceptedPublic

Authored by anna on Jun 29 2023, 2:47 PM.

Details

Summary

When we have small trip count loops, the cost of out of loop reduction
becomes significant. We do not consider the cost of out of loop
reductions in loop vectorizer (in-loop vectorizations are handled in
cost modelling).

This patch extends the logic used by cost modelling of runtime checks to figure out the minimum trip count under which runtime checks are profitable. We reuse the same idea for out of loop reductions.

Diff Detail

Event Timeline

anna created this revision.Jun 29 2023, 2:47 PM
anna requested review of this revision.Jun 29 2023, 2:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 2:47 PM

Sounds OK to me. Did you consider including the cost of reductions in the existing vectorizer cost calculations, as opposed to a hard limit?

fhahn added a comment.Jul 3 2023, 5:37 AM

From the description, it is not clear why FMinimum/FMaximum reductions in particular should be a blocker for vectorizing small trip count loops. What about other reductions?

Given that this is for short trip counts, it might be better/more general to extend the code that computes the minimum trip count needs to negate the overhead of runtime checks to also consider the overhead of computing the final reduction value (see https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L10047)

nikic added a subscriber: nikic.Jul 3 2023, 5:53 AM

From the description, it is not clear why FMinimum/FMaximum reductions in particular should be a blocker for vectorizing small trip count loops. What about other reductions?

Given that this is for short trip counts, it might be better/more general to extend the code that computes the minimum trip count needs to negate the overhead of runtime checks to also consider the overhead of computing the final reduction value (see https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L10047)

Maybe this generalized version would also fix https://github.com/llvm/llvm-project/issues/57476.

I support Florian's suggestion. We better take reduction overhead into account the same way as we did for runtime checks.
Today's cost model is very limited in its ability to predict the overhead which comes with the vectorization. For example, the overhead of the scalar epilogue loop is simply ignored. I think, one day we should generalize the cost model so that vectorization overhead is integral part of the model itself rather than a follow up "fix up".

anna planned changes to this revision.Jul 13 2023, 5:36 PM

thank you Florian, this is a nice idea. Working on it.

anna updated this revision to Diff 552416.Aug 22 2023, 10:05 AM

Rewrote the patch to extend existing logic for runtime checks calculation.
This now supports all kinds of out of loop reductions. Since we compute an upper-bound on MinProfitableTripCount, this maybe a more conservative estimate.

anna updated this revision to Diff 553476.Aug 25 2023, 7:52 AM

added an option which switches this off by default. The failures above are related to ARM and RISCV targets.

anna added a comment.Aug 30 2023, 7:46 AM

Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops.
There maybe some fallouts where correction in reductions cost modelling will be required.

fhahn added a comment.Aug 30 2023, 7:52 AM

Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops.
There maybe some fallouts where correction in reductions cost modelling will be required.

What are the regressions on ARM/RISCV? I think we should aim to enable this by default for all platforms, otherwise it is at the risk of not getting enabled. Also, those tests may show issues with the current implementation.

Do you have any numbers on the impact on X86?

anna updated this revision to Diff 555134.Aug 31 2023, 12:02 PM

changed the flag to on by default. Fixed the ARM/RISCV tests (by updating the lines). More details will follow.

anna added a comment.Aug 31 2023, 12:23 PM

Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops.
There maybe some fallouts where correction in reductions cost modelling will be required.

What are the regressions on ARM/RISCV? I think we should aim to enable this by default for all platforms, otherwise it is at the risk of not getting enabled. Also, those tests may show issues with the current implementation.

The regressions on ARM and RISCV are because we have an updated minimum trip count we expect (to offset the cost of the out-of-loop reductions), so the CHECK lines have been updated accordingly. RISCV looks okay with the minimum trip count being at least 4 or 8 (we round the MinimumTripCount up to VF, so it is more conservative).

However, in the case of ARM, there is no special cost for min/max reductions. They compute a pretty expensive cost under the assumption that we expand these reductions and extract the value. On some simple tests I ran, the assembly generated supports this cost for the -mtriple=armv7-none-eabi, but the code generated is different for another triple such as -mtriple=thumbv8.1m.main-none-none-eabi. I've added more details under the respective test.

Do you have any numbers on the impact on X86?

I will have some numbers soon. The tests are still running. We see positive impact on small trip count loops on our workloads (in the range of 40% improvement) - these workloads are small trip count loops with fminimum/fmaximum reductions where we no longer vectorize.

llvm/test/Transforms/LoopVectorize/ARM/tail-fold-multiple-icmps.ll
12 ↗(On Diff #555134)

This minimum trip count of 56 comes about because we compute the reduction cost as 140 (and there are two such reductions, with total being 280).
This is a pretty high cost which is computed through https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L2372.
This is correct for a triple with mattr such as:

llc < %s -mtriple=armv7-none-eabi -float-abi=hard -mattr=+neon -verify-machineinstrs | FileCheck %s

define i8 @test_umin_v8i8(<8 x i8> %x) {
; CHECK-LABEL: test_umin_v8i8:
; CHECK:       @ %bb.0: @ %entry
; CHECK-NEXT:    vpmin.u8 d16, d0, d0
; CHECK-NEXT:    vpmin.u8 d16, d16, d16
; CHECK-NEXT:    vpmin.u8 d16, d16, d16
; CHECK-NEXT:    vmov.u8 r0, d16[0]
; CHECK-NEXT:    bx lr
entry:
  %z = call i8 @llvm.vector.reduce.umin.v8i8(<8 x i8> %x) 
  ret i8 %z
}

However, I'm not sure if this is a correct cost for other triples.

I can look into the Arm costs. Some of them were deliberately high to try and stop bad vectorization from happening (or they haven't come up much in the past).

I've made some changes for the Arm MVE and NEON costs. Can you try a rebase? Thanks

anna added a comment.Sep 5 2023, 8:53 AM

I've made some changes for the Arm MVE and NEON costs. Can you try a rebase? Thanks

Thanks @dmgreen! The updated MinTripCount looks better now (4 instead of 56). I will update the patch.

anna added a comment.Sep 6 2023, 11:19 AM

Our X86 results are in. Over 244 workloads, the geomean changes by about ~0.4%. Looking at individual workloads, there are no major gains or regressions in large applications (unfortunately, we cannot share the exact benchmark names publicly). However, we do see big gains in 3 of these workloads where it performs a floating point minimum/maximum reduction over a float array of 3 elements (without this change we were vectorizing it).
Overall, I think the change is still reasonable to have since we now account for out of loop reductions more accurately.

Also, based on the RISCV and ARM results, we can see the minimum trip count is more reasonable numbers (4 or 8 for both targets). I will update the rebased patch.

anna updated this revision to Diff 556061.Sep 6 2023, 11:21 AM

rebased over changes from dmgreen with more accurate costs for ARM out-of-loop reductions.

anna edited the summary of this revision. (Show Details)Sep 6 2023, 11:29 AM
anna added a comment.Sep 19 2023, 6:20 AM

Any comments to progress this further?

nikic added a comment.Sep 19 2023, 7:00 AM

Would you mind testing whether this also fixes https://github.com/llvm/llvm-project/issues/57476 and adding it as a test case if so?

anna added a comment.Sep 20 2023, 7:04 AM

Would you mind testing whether this also fixes https://github.com/llvm/llvm-project/issues/57476 and adding it as a test case if so?

Unfortunately, it doesn't help when the trip count is exactly specified at compile time. However, if it is through a branch weight profile (having same behaviour as trip-count=2), with this patch we no longer vectorize ctpop. The reason is that when exact trip count =2, we clamp the max VF to be exactly 2, the reduction cost = 1 and we still vectorize. With branch profile stating trip count will be 2, we go through VF=4 and without the patch we end up vectorizing with masked vector instructions. With the patch, the vector reduce.add with VF=4 ends up being costlier for the small trip count loop and we avoid vectorizing. I will add both test cases to show the difference.

anna updated this revision to Diff 557112.Sep 20 2023, 7:07 AM

Added ctpop test

I ran some tests and they looked OK. Sometimes these RuntimeChecks costs need to be quite precise - more precise than they have needed to be for vectorization in the past, and it can end up with cases where you spend some of the overhead costs of vectorization without getting the benefits, but the results looks good from what I can tell.

I might regret mentioning this if it increases the costs, but should interleaved loops get an extra cost to account for reducing the vectors by UF, into a single vector that is then vec.reduced?

anna added a comment.Sep 28 2023, 6:35 AM

Thanks for testing the patch @dmgreen!

I might regret mentioning this if it increases the costs, but should interleaved loops get an extra cost to account for reducing the vectors by UF, into a single vector that is then vec.reduced?

That's a good point, I think there is a higher chance of "touching a lot more cases" on vectorization :) but it seems that can be done as a later patch, if needed?

dmgreen accepted this revision.Sep 29 2023, 3:20 AM

Yeah that sounds OK to me. I think the extra cost should be fairly minor.

This LGTM if no-one else has any other comments.

This revision is now accepted and ready to land.Sep 29 2023, 3:20 AM
nikic added a comment.Fri, Dec 1, 1:14 AM

Any particular reason this hasn't been landed yet?