This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Increase default vectorizer LMUL to 2
ClosedPublic

Authored by luke on Feb 10 2023, 3:32 AM.

Details

Summary

After some discussion and experimentation, we have seen that changing the default number of vector register bits to LMUL=2 strikes a sweet spot.
Whilst we could be clever here and make the vectorizer smarter about dynamically selecting an LMUL that
a) Doesn't affect register pressure
b) Suitable for the microarchitecture
we would need to teach its heuristics about RISC-V register grouping specifics.
Instead this just does the easy, pragmatic thing by changing the default to a safe value that doesn't affect register pressure signifcantly[1], but should increase throughput and unlock more interleaving.

[1] Register spilling when compiling sqlite at various levels of -riscv-v-register-bit-width-lmul:

LMUL=1 2573 spills
LMUL=2 2583 spills
LMUL=4 2819 spills
LMUL=8 3256 spills

Diff Detail

Event Timeline

luke created this revision.Feb 10 2023, 3:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 3:32 AM
luke requested review of this revision.Feb 10 2023, 3:32 AM
craig.topper added inline comments.Feb 10 2023, 10:16 PM
llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
17

Why are we generating 2 loads, 2 adds, and 2 stores now? I thought this should only change the types, not the number of instructions generated

I chatted w/Luke about this offline before he posted the patch, but let me lay out the major concern and mitigation strategies here.

Moving from LMUL1 to LMUL2 increases the VF used during vectorization. Since we don't currently tail fold by default, this means both that a) there are more loops too short to benefit from vectorization, and b) that the tail is (on average) longer. Both of these could potentially negatively impact performance. The former could result in a higher fraction of wasted code size.

The major mitigation options are:

  1. Make the vectorizer smarter about taking into consideration predicted loop length (from profiling), and dynamically select an LMUL to minimize chances of (a). We could also make the vectorizer take known into account known trip count modulos when selecting VF. Both of these are tricky with scalable types, and we'd probably have to resort to heuristics using vscalefortuning.
  1. Enable tail folding using masking. This support exists in tree today, and with some quick testing appears to be reasonable robust. The downside is that masking has a non-zero execution cost. The major effect of this is to eliminate concern (b), though we're left with a profitability concern around (a). The bypass heuristic here would probably need some careful tuning. I don't think we need to enable tail folding *before* moving to LMUL2, but if we see regressions, this would probably be the first knob to try.
  1. Pursue tail folding via VL (i.e. VP intrinsic based). We've talked about this before, but it's a lot of work. I'm hoping we don't need to have gotten all the way to VL based predication to enable LMUL2, but well, we'll see.

Overall, I think it's reasonable to move forward with the change to LMUL2 (once the unrolling issue is fixed), but we need to be clear this is somewhat speculative and there's a very decent chance we may have to adjust course.

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
17

This is the same issue I noticed in the test change. There appears to be an unexpected interaction between lmul and unrolling going on here.

We could change getMaxInterleaveFactor() to return 1 instead of 2. We probably should do that since LMUL is like hardware interleaving.

I would still like to know why increasing LMUL also increased interleaving.

loralb added a subscriber: loralb.Feb 16 2023, 9:01 AM

We could change getMaxInterleaveFactor() to return 1 instead of 2. We probably should do that since LMUL is like hardware interleaving.

I would still like to know why increasing LMUL also increased interleaving.

Hi everyone!
About @craig.topper question, at BSC we already had the "pleasure" of stumbling into this: as far as my understanding goes, this happens because of the getMaxInterleaveFactor() function in llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h. Basically, a VF value of 1 is interpreted as "loop not vectorized" (since, for example, [1 x i64] is like a scalar i64), hence disabling interleaving. With a default LMUL value of 2 though, the VF == 1 check fails, meaning the actual MaxInterleavingFactor value is used.

P.S. As you may have noticed, the check in getMaxInterleaveFactor() completly ignores the existence of scalable vectors.

We could change getMaxInterleaveFactor() to return 1 instead of 2. We probably should do that since LMUL is like hardware interleaving.

I would still like to know why increasing LMUL also increased interleaving.

Hi everyone!
About @craig.topper question, at BSC we already had the "pleasure" of stumbling into this: as far as my understanding goes, this happens because of the getMaxInterleaveFactor() function in llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h. Basically, a VF value of 1 is interpreted as "loop not vectorized" (since, for example, [1 x i64] is like a scalar i64), hence disabling interleaving. With a default LMUL value of 2 though, the VF == 1 check fails, meaning the actual MaxInterleavingFactor value is used.

P.S. As you may have noticed, the check in getMaxInterleaveFactor() completly ignores the existence of scalable vectors.

Thanks @loralb that makes perfect sense.

eopXD added a comment.EditedMar 9 2023, 1:31 AM

@loralb Just a heads up, I have a patch that will resolve the problem of disabling the vectorizer when interleave factor is 1. Haven't updated it in a while and hope I can land it before discussion here is converged. https://reviews.llvm.org/D134745

reames added a comment.Mar 9 2023, 8:17 AM

@loralb Just a heads up, I have a patch that will resolve the problem of disabling the vectorizer when interleave factor is 1. Haven't updated it in a while and hope I can land it before discussion here is converged. https://reviews.llvm.org/D134745

@eopXD This issue discussed here was resolved by https://reviews.llvm.org/D144474 (which already landed). I think your patch is an unrelated issue.

eopXD added a comment.Mar 9 2023, 6:08 PM

@loralb Just a heads up, I have a patch that will resolve the problem of disabling the vectorizer when interleave factor is 1. Haven't updated it in a while and hope I can land it before discussion here is converged. https://reviews.llvm.org/D134745

@eopXD This issue discussed here was resolved by https://reviews.llvm.org/D144474 (which already landed). I think your patch is an unrelated issue.

Yes you are correct, I mistaken the comment above. Please disregard my heads-up.

craig.topper added inline comments.Mar 9 2023, 8:17 PM
llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
1–2

If we put this RUN line last, will it prevent the script from reordering the rest of the file?

luke updated this revision to Diff 504587.Mar 13 2023, 4:09 AM

Reshuffle run lines to make the diff cleaner

luke added inline comments.Mar 13 2023, 4:10 AM
llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
1–2

Good idea, it does indeed

luke marked 3 inline comments as done.Mar 13 2023, 4:10 AM
This revision is now accepted and ready to land.Mar 13 2023, 2:22 PM
luke updated this revision to Diff 504899.Mar 13 2023, 5:29 PM

Update tests

luke updated this revision to Diff 504904.Mar 13 2023, 6:06 PM

Undo unintentional steamrolling of handwritten tests with update_test_checks

luke added a comment.Mar 15 2023, 2:58 AM

@craig.topper gentle ping that I've updated the tests and a lot of them have changed, not sure how I didn't include these beforehand

luke added inline comments.Mar 15 2023, 3:02 AM
llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll
75

Doesn't use masked load anymore

llvm/test/Transforms/LoopVectorize/RISCV/zvl32b.ll
23

Less interleaving here

luke updated this revision to Diff 506921.Mar 21 2023, 5:00 AM

Update SLP test case

SLP calls this hook too to work out the max vec reg size.
With LMUL=2, the test case now has 128*2 = 256 bits to work with per "register", so it can now vectorize those 4 i64 stores.
Note that this doesn't actually change the generated code, both -riscv-v-register-bit-width-lmul=1 and -riscv-v-register-bit-width-lmul=2 produce the following:

foo:

vsetivli        zero, 4, e64, m2, ta, ma
vmv.v.i v8, 0
vse64.v v8, (a0)
ret
This revision was landed with ongoing or failed builds.Mar 23 2023, 3:33 AM
This revision was automatically updated to reflect the committed changes.