Download Raw Diff

Details

Reviewers

reames
craig.topper
asb

Commits

rG8d16c6809a08: [RISCV] Increase default vectorizer LMUL to 2

Summary

After some discussion and experimentation, we have seen that changing the default number of vector register bits to LMUL=2 strikes a sweet spot.
Whilst we could be clever here and make the vectorizer smarter about dynamically selecting an LMUL that
a) Doesn't affect register pressure
b) Suitable for the microarchitecture
we would need to teach its heuristics about RISC-V register grouping specifics.
Instead this just does the easy, pragmatic thing by changing the default to a safe value that doesn't affect register pressure signifcantly[1], but should increase throughput and unlock more interleaving.

[1] Register spilling when compiling sqlite at various levels of -riscv-v-register-bit-width-lmul:

LMUL=1 2573 spills
LMUL=2 2583 spills
LMUL=4 2819 spills
LMUL=8 3256 spills

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

luke created this revision.Feb 10 2023, 3:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 3:32 AM

Herald added subscribers: pmatos, VincentWu, vkmr and 27 others. · View Herald Transcript

luke requested review of this revision.Feb 10 2023, 3:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2023, 3:32 AM

Herald added subscribers: llvm-commits, • pcwang-thead, eopXD, MaskRay. · View Herald Transcript

luke added a parent revision: D143722: [RISCV][NFC] Add test for different LMULs in vectorizer.Feb 10 2023, 3:32 AM

luke edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B213004: Diff 496404.Feb 10 2023, 4:36 AM

craig.topper added inline comments.Feb 10 2023, 10:16 PM

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
32	Why are we generating 2 loads, 2 adds, and 2 stores now? I thought this should only change the types, not the number of instructions generated

I chatted w/Luke about this offline before he posted the patch, but let me lay out the major concern and mitigation strategies here.

Moving from LMUL1 to LMUL2 increases the VF used during vectorization. Since we don't currently tail fold by default, this means both that a) there are more loops too short to benefit from vectorization, and b) that the tail is (on average) longer. Both of these could potentially negatively impact performance. The former could result in a higher fraction of wasted code size.

The major mitigation options are:

Make the vectorizer smarter about taking into consideration predicted loop length (from profiling), and dynamically select an LMUL to minimize chances of (a). We could also make the vectorizer take known into account known trip count modulos when selecting VF. Both of these are tricky with scalable types, and we'd probably have to resort to heuristics using vscalefortuning.

Enable tail folding using masking. This support exists in tree today, and with some quick testing appears to be reasonable robust. The downside is that masking has a non-zero execution cost. The major effect of this is to eliminate concern (b), though we're left with a profitability concern around (a). The bypass heuristic here would probably need some careful tuning. I don't think we need to enable tail folding *before* moving to LMUL2, but if we see regressions, this would probably be the first knob to try.

Pursue tail folding via VL (i.e. VP intrinsic based). We've talked about this before, but it's a lot of work. I'm hoping we don't need to have gotten all the way to VL based predication to enable LMUL2, but well, we'll see.

Overall, I think it's reasonable to move forward with the change to LMUL2 (once the unrolling issue is fixed), but we need to be clear this is somewhat speculative and there's a very decent chance we may have to adjust course.

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
32	This is the same issue I noticed in the test change. There appears to be an unexpected interaction between lmul and unrolling going on here.

We could change getMaxInterleaveFactor() to return 1 instead of 2. We probably should do that since LMUL is like hardware interleaving.

I would still like to know why increasing LMUL also increased interleaving.

In D143723#4128065, @craig.topper wrote:

We could change getMaxInterleaveFactor() to return 1 instead of 2. We probably should do that since LMUL is like hardware interleaving.

I would still like to know why increasing LMUL also increased interleaving.

Hi everyone!
About @craig.topper question, at BSC we already had the "pleasure" of stumbling into this: as far as my understanding goes, this happens because of the getMaxInterleaveFactor() function in llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h. Basically, a VF value of 1 is interpreted as "loop not vectorized" (since, for example, [1 x i64] is like a scalar i64), hence disabling interleaving. With a default LMUL value of 2 though, the VF == 1 check fails, meaning the actual MaxInterleavingFactor value is used.

P.S. As you may have noticed, the check in getMaxInterleaveFactor() completly ignores the existence of scalable vectors.

In D143723#4132349, @loralb wrote:

In D143723#4128065, @craig.topper wrote:

We could change getMaxInterleaveFactor() to return 1 instead of 2. We probably should do that since LMUL is like hardware interleaving.

I would still like to know why increasing LMUL also increased interleaving.

Hi everyone!
About @craig.topper question, at BSC we already had the "pleasure" of stumbling into this: as far as my understanding goes, this happens because of the getMaxInterleaveFactor() function in llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h. Basically, a VF value of 1 is interpreted as "loop not vectorized" (since, for example, [1 x i64] is like a scalar i64), hence disabling interleaving. With a default LMUL value of 2 though, the VF == 1 check fails, meaning the actual MaxInterleavingFactor value is used.

P.S. As you may have noticed, the check in getMaxInterleaveFactor() completly ignores the existence of scalable vectors.

Thanks @loralb that makes perfect sense.

luke mentioned this in D144474: [LV][NFC] Use ElementCount for getMaxInterleaveFactor.Feb 21 2023, 5:15 AM

luke mentioned this in rGb02b1e0ed672: [LV][NFC] Use ElementCount for getMaxInterleaveFactor.Feb 22 2023, 2:15 AM

Rebase

Harbormaster completed remote builds in B215202: Diff 499435.Feb 22 2023, 4:12 AM

@loralb Just a heads up, I have a patch that will resolve the problem of disabling the vectorizer when interleave factor is 1. Haven't updated it in a while and hope I can land it before discussion here is converged. https://reviews.llvm.org/D134745

In D143723#4180827, @eopXD wrote:

@loralb Just a heads up, I have a patch that will resolve the problem of disabling the vectorizer when interleave factor is 1. Haven't updated it in a while and hope I can land it before discussion here is converged. https://reviews.llvm.org/D134745

@eopXD This issue discussed here was resolved by https://reviews.llvm.org/D144474 (which already landed). I think your patch is an unrelated issue.

In D143723#4181703, @reames wrote:

In D143723#4180827, @eopXD wrote:

@loralb Just a heads up, I have a patch that will resolve the problem of disabling the vectorizer when interleave factor is 1. Haven't updated it in a while and hope I can land it before discussion here is converged. https://reviews.llvm.org/D134745

@eopXD This issue discussed here was resolved by https://reviews.llvm.org/D144474 (which already landed). I think your patch is an unrelated issue.

Yes you are correct, I mistaken the comment above. Please disregard my heads-up.

craig.topper added inline comments.Mar 9 2023, 8:17 PM

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
2	If we put this RUN line last, will it prevent the script from reordering the rest of the file?

Reshuffle run lines to make the diff cleaner

Herald added a subscriber: jobnoorman. · View Herald TranscriptMar 13 2023, 4:09 AM

luke added inline comments.Mar 13 2023, 4:10 AM

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll
2	Good idea, it does indeed

luke marked 3 inline comments as done.Mar 13 2023, 4:10 AM

Harbormaster completed remote builds in B218988: Diff 504587.Mar 13 2023, 5:10 AM

LGTM

This revision is now accepted and ready to land.Mar 13 2023, 2:22 PM

Update tests

Herald added a subscriber: arphaman. · View Herald TranscriptMar 13 2023, 5:29 PM

Undo unintentional steamrolling of handwritten tests with update_test_checks

Harbormaster completed remote builds in B219216: Diff 504904.Mar 13 2023, 7:19 PM

@craig.topper gentle ping that I've updated the tests and a lot of them have changed, not sure how I didn't include these beforehand

luke added inline comments.Mar 15 2023, 3:02 AM

llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll
75 ↗	(On Diff #504904)	Doesn't use masked load anymore
llvm/test/Transforms/LoopVectorize/RISCV/zvl32b.ll
24 ↗	(On Diff #504904)	Less interleaving here

LGTM

Update SLP test case

SLP calls this hook too to work out the max vec reg size.
With LMUL=2, the test case now has 128*2 = 256 bits to work with per "register", so it can now vectorize those 4 i64 stores.
Note that this doesn't actually change the generated code, both -riscv-v-register-bit-width-lmul=1 and -riscv-v-register-bit-width-lmul=2 produce the following:

foo:

vsetivli        zero, 4, e64, m2, ta, ma
vmv.v.i v8, 0
vse64.v v8, (a0)
ret

Harbormaster completed remote builds in B220679: Diff 506921.Mar 21 2023, 6:00 AM

Rebase

Harbormaster completed remote builds in B221052: Diff 507401.Mar 22 2023, 10:48 AM

This revision was landed with ongoing or failed builds.Mar 23 2023, 3:33 AM

Closed by commit rG8d16c6809a08: [RISCV] Increase default vectorizer LMUL to 2 (authored by luke). · Explain Why

This revision was automatically updated to reflect the committed changes.

luke added a commit: rG8d16c6809a08: [RISCV] Increase default vectorizer LMUL to 2.

Diff 496404

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

	Show All 17 Lines

	#define DEBUG_TYPE "riscvtti"			#define DEBUG_TYPE "riscvtti"

	static cl::opt<unsigned> RVVRegisterWidthLMUL(			static cl::opt<unsigned> RVVRegisterWidthLMUL(
	"riscv-v-register-bit-width-lmul",			"riscv-v-register-bit-width-lmul",
	cl::desc(			cl::desc(
	"The LMUL to use for getRegisterBitWidth queries. Affects LMUL used "			"The LMUL to use for getRegisterBitWidth queries. Affects LMUL used "
	"by autovectorized code. Fractional LMULs are not supported."),			"by autovectorized code. Fractional LMULs are not supported."),
	cl::init(1), cl::Hidden);			cl::init(2), cl::Hidden);

	static cl::opt<unsigned> SLPMaxVF(			static cl::opt<unsigned> SLPMaxVF(
	"riscv-v-slp-max-vf",			"riscv-v-slp-max-vf",
	cl::desc(			cl::desc(
	"Result used for getMaximumVF query which is used exclusively by "			"Result used for getMaximumVF query which is used exclusively by "
	"SLP vectorizer. Defaults to 1 which disables SLP."),			"SLP vectorizer. Defaults to 1 which disables SLP."),
	cl::init(1), cl::Hidden);			cl::init(1), cl::Hidden);

	▲ Show 20 Lines • Show All 1,452 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S \| FileCheck %s -check-prefix=DEFAULT			; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S \| FileCheck %s -check-prefix=DEFAULT
				craig.topperUnsubmitted Done Reply Inline Actions If we put this RUN line last, will it prevent the script from reordering the rest of the file? craig.topper: If we put this RUN line last, will it prevent the script from reordering the rest of the file?
				lukeAuthorUnsubmitted Done Reply Inline Actions Good idea, it does indeed luke: Good idea, it does indeed
	; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=1 \| FileCheck %s -check-prefix=LMUL1			; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=1 \| FileCheck %s -check-prefix=LMUL1
	; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=2 \| FileCheck %s -check-prefix=LMUL2			; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=2 \| FileCheck %s -check-prefix=LMUL2
	; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=4 \| FileCheck %s -check-prefix=LMUL4			; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=4 \| FileCheck %s -check-prefix=LMUL4
	; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=8 \| FileCheck %s -check-prefix=LMUL8			; RUN: opt < %s -passes=loop-vectorize -mtriple riscv64 -mattr=+v -S --riscv-v-register-bit-width-lmul=8 \| FileCheck %s -check-prefix=LMUL8

	define void @load_store(ptr %p) {			define void @load_store(ptr %p) {
	; DEFAULT-LABEL: @load_store(			; DEFAULT-LABEL: @load_store(
	; DEFAULT-NEXT: entry:			; DEFAULT-NEXT: entry:
	; DEFAULT-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; DEFAULT-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP0]]			; DEFAULT-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; DEFAULT-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]
	; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; DEFAULT-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; DEFAULT: vector.ph:			; DEFAULT: vector.ph:
	; DEFAULT-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()			; DEFAULT-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
	; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP1]]			; DEFAULT-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; DEFAULT-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]
	; DEFAULT-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]			; DEFAULT-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
	; DEFAULT-NEXT: br label [[VECTOR_BODY:%.*]]			; DEFAULT-NEXT: br label [[VECTOR_BODY:%.*]]
	; DEFAULT: vector.body:			; DEFAULT: vector.body:
	; DEFAULT-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; DEFAULT-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; DEFAULT-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 0			; DEFAULT-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
	; DEFAULT-NEXT: [[TMP3:%.]] = getelementptr inbounds i64, ptr [[P:%.]], i64 [[TMP2]]			; DEFAULT-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
	; DEFAULT-NEXT: [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[TMP3]], i32 0			; DEFAULT-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 2
	; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 1 x i64>, ptr [[TMP4]], align 4			; DEFAULT-NEXT: [[TMP7:%.*]] = add i64 [[TMP6]], 0
	; DEFAULT-NEXT: [[TMP5:%.*]] = add <vscale x 1 x i64> [[WIDE_LOAD]], shufflevector (<vscale x 1 x i64> insertelement (<vscale x 1 x i64> poison, i64 1, i64 0), <vscale x 1 x i64> poison, <vscale x 1 x i32> zeroinitializer)			; DEFAULT-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 1
	; DEFAULT-NEXT: store <vscale x 1 x i64> [[TMP5]], ptr [[TMP4]], align 4			; DEFAULT-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], [[TMP8]]
	; DEFAULT-NEXT: [[TMP6:%.*]] = call i64 @llvm.vscale.i64()			; DEFAULT-NEXT: [[TMP10:%.]] = getelementptr inbounds i64, ptr [[P:%.]], i64 [[TMP4]]
	; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP6]]			; DEFAULT-NEXT: [[TMP11:%.*]] = getelementptr inbounds i64, ptr [[P]], i64 [[TMP9]]
	; DEFAULT-NEXT: [[TMP7:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]			; DEFAULT-NEXT: [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[TMP10]], i32 0
	; DEFAULT-NEXT: br i1 [[TMP7]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; DEFAULT-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP12]], align 4
				craig.topperUnsubmitted Done Reply Inline Actions Why are we generating 2 loads, 2 adds, and 2 stores now? I thought this should only change the types, not the number of instructions generated craig.topper: Why are we generating 2 loads, 2 adds, and 2 stores now? I thought this should only change the…
				reamesUnsubmitted Done Reply Inline Actions This is the same issue I noticed in the test change. There appears to be an unexpected interaction between lmul and unrolling going on here. reames: This is the same issue I noticed in the test change. There appears to be an unexpected…
				; DEFAULT-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
				; DEFAULT-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 2
				; DEFAULT-NEXT: [[TMP15:%.*]] = getelementptr inbounds i64, ptr [[TMP10]], i64 [[TMP14]]
				; DEFAULT-NEXT: [[WIDE_LOAD1:%.*]] = load <vscale x 2 x i64>, ptr [[TMP15]], align 4
				; DEFAULT-NEXT: [[TMP16:%.*]] = add <vscale x 2 x i64> [[WIDE_LOAD]], shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				; DEFAULT-NEXT: [[TMP17:%.*]] = add <vscale x 2 x i64> [[WIDE_LOAD1]], shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				; DEFAULT-NEXT: store <vscale x 2 x i64> [[TMP16]], ptr [[TMP12]], align 4
				; DEFAULT-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
				; DEFAULT-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 2
				; DEFAULT-NEXT: [[TMP20:%.*]] = getelementptr inbounds i64, ptr [[TMP10]], i64 [[TMP19]]
				; DEFAULT-NEXT: store <vscale x 2 x i64> [[TMP17]], ptr [[TMP20]], align 4
				; DEFAULT-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
				; DEFAULT-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
				; DEFAULT-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP22]]
				; DEFAULT-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; DEFAULT-NEXT: br i1 [[TMP23]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; DEFAULT: middle.block:			; DEFAULT: middle.block:
	; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]			; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
	; DEFAULT-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]			; DEFAULT-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
	; DEFAULT: scalar.ph:			; DEFAULT: scalar.ph:
	; DEFAULT-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; DEFAULT-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; DEFAULT-NEXT: br label [[FOR_BODY:%.*]]			; DEFAULT-NEXT: br label [[FOR_BODY:%.*]]
	; DEFAULT: for.body:			; DEFAULT: for.body:
	; DEFAULT-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]			; DEFAULT-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
	▲ Show 20 Lines • Show All 240 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Increase default vectorizer LMUL to 2
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 496404

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Increase default vectorizer LMUL to 2ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 496404

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

llvm/test/Transforms/LoopVectorize/RISCV/lmul.ll

[RISCV] Increase default vectorizer LMUL to 2
ClosedPublic