This is an archive of the discontinued LLVM Phabricator instance.

[LV] Account for minimum vscale when rejecting scalable vectorization of short loops
ClosedPublic

Authored by reames on Nov 2 2022, 1:17 PM.

Download Raw Diff

Details

Reviewers

fhahn
david-arm

Commits

rGb0f904b6da04: [LV] Account for minimum vscale when rejecting scalable vectorization of short…

Summary

The vectorizer has code to reject scalable vectorization of loops with very short trip counts, and instead use fixed length vectors. The current code doesn't account for the minimum vscale value known, and thus under estimates the number of lanes in the scalable type for RISCV's default configuration. This results in use of predication and a trivially dead loop where a single straight line piece of code would suffice.

Note that the code quality of the original scalable vectorization could (and probably should) be improved other ways as well. This patch is solely about whether the scalable vectorization was the right choice to begin with.

This bit of code - both with and without my change - does make the unchecked assumption that the target knows how to lower fixed length vectors whose length is provably less than the vector length.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

reames created this revision.Nov 2 2022, 1:17 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 2 2022, 1:17 PM

Herald added subscribers: frasercrmck, luismarques, apazos and 22 others. · View Herald Transcript

reames requested review of this revision.Nov 2 2022, 1:17 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 2 2022, 1:17 PM

Herald added subscribers: • pcwang-thead, alextsao1999, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B195788: Diff 472736.Nov 2 2022, 3:02 PM

ping

david-arm added inline comments.Dec 6 2022, 6:49 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5213	nit: I think you can combine these two if statements into one, i.e. if (MaxVectorElementCount.isScalable() && TheFunction->hasFnAttribute(Attribute::VScaleRange)) { auto Attr = TheFunction->getFnAttribute(Attribute::VScaleRange); auto Min = Attr.getVScaleRangeMin(); WidestRegisterMinEC *= Min; }
llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll
3–4	I think this comment is no longer valid.
3–4	Perhaps it's better to have two tests now - one with min vscale = 4, and one with min vscale = 1, so that you can demonstrate the expected behaviour of using tail-folded loops for the latter as well?

Addressed two review comments, will reply to third in a moment.

Address third review comment.

In the process of writing my comment explaining why I didn't understand what was meant, I realized I did understand what was meant. :)

LGTM!

llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll
58	Ah ok I hadn't realised this is incompatible. I mainly suggested min=1 as a way to still show the tail-folding behaviour for low trip counts, but I guess I could also have suggested raising the known trip count to a higher value. Anyway, thank you!

This revision is now accepted and ready to land.Dec 7 2022, 8:58 AM

Harbormaster completed remote builds in B201713: Diff 480927.Dec 7 2022, 6:23 PM

Closed by commit rGb0f904b6da04: [LV] Account for minimum vscale when rejecting scalable vectorization of short… (authored by reames). · Explain WhyDec 9 2022, 11:30 AM

This revision was automatically updated to reflect the committed changes.

reames added a commit: rGb0f904b6da04: [LV] Account for minimum vscale when rejecting scalable vectorization of short….

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

13 lines

test/

Transforms/

LoopVectorize/

RISCV/

short-trip-count.ll

65 lines

Diff 481709

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,176 Lines • ▼ Show 20 Lines	reportVectorizationFailure(
"NoTailLoopWithOptForSize", ORE, TheLoop);		"NoTailLoopWithOptForSize", ORE, TheLoop);
return FixedScalableVFPair::getNone();		return FixedScalableVFPair::getNone();
}		}

ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(		ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(
unsigned ConstTripCount, unsigned SmallestType, unsigned WidestType,		unsigned ConstTripCount, unsigned SmallestType, unsigned WidestType,
ElementCount MaxSafeVF, bool FoldTailByMasking) {		ElementCount MaxSafeVF, bool FoldTailByMasking) {
bool ComputeScalableMaxVF = MaxSafeVF.isScalable();		bool ComputeScalableMaxVF = MaxSafeVF.isScalable();
TypeSize WidestRegister = TTI.getRegisterBitWidth(		const TypeSize WidestRegister = TTI.getRegisterBitWidth(
ComputeScalableMaxVF ? TargetTransformInfo::RGK_ScalableVector		ComputeScalableMaxVF ? TargetTransformInfo::RGK_ScalableVector
: TargetTransformInfo::RGK_FixedWidthVector);		: TargetTransformInfo::RGK_FixedWidthVector);

// Convenience function to return the minimum of two ElementCounts.		// Convenience function to return the minimum of two ElementCounts.
auto MinVF = [](const ElementCount &LHS, const ElementCount &RHS) {		auto MinVF = [](const ElementCount &LHS, const ElementCount &RHS) {
assert((LHS.isScalable() == RHS.isScalable()) &&		assert((LHS.isScalable() == RHS.isScalable()) &&
"Scalable flags must match");		"Scalable flags must match");
return ElementCount::isKnownLT(LHS, RHS) ? LHS : RHS;		return ElementCount::isKnownLT(LHS, RHS) ? LHS : RHS;
Show All 10 Lines	ElementCount LoopVectorizationCostModel::getMaximizedVFForTarget(

if (!MaxVectorElementCount) {		if (!MaxVectorElementCount) {
LLVM_DEBUG(dbgs() << "LV: The target has no "		LLVM_DEBUG(dbgs() << "LV: The target has no "
<< (ComputeScalableMaxVF ? "scalable" : "fixed")		<< (ComputeScalableMaxVF ? "scalable" : "fixed")
<< " vector registers.\n");		<< " vector registers.\n");
return ElementCount::getFixed(1);		return ElementCount::getFixed(1);
}		}

const auto TripCountEC = ElementCount::getFixed(ConstTripCount);		unsigned WidestRegisterMinEC = MaxVectorElementCount.getKnownMinValue();
if (ConstTripCount &&		if (MaxVectorElementCount.isScalable() &&
		david-armUnsubmitted Not Done Reply Inline Actions nit: I think you can combine these two if statements into one, i.e. if (MaxVectorElementCount.isScalable() && TheFunction->hasFnAttribute(Attribute::VScaleRange)) { auto Attr = TheFunction->getFnAttribute(Attribute::VScaleRange); auto Min = Attr.getVScaleRangeMin(); WidestRegisterMinEC = Min; } david-arm:* nit: I think you can combine these two if statements into one, i.e. if…
ElementCount::isKnownLE(TripCountEC, MaxVectorElementCount) &&		TheFunction->hasFnAttribute(Attribute::VScaleRange)) {
		auto Attr = TheFunction->getFnAttribute(Attribute::VScaleRange);
		auto Min = Attr.getVScaleRangeMin();
		WidestRegisterMinEC *= Min;
		}
		if (ConstTripCount && ConstTripCount <= WidestRegisterMinEC &&
(!FoldTailByMasking \|\| isPowerOf2_32(ConstTripCount))) {		(!FoldTailByMasking \|\| isPowerOf2_32(ConstTripCount))) {
// If loop trip count (TC) is known at compile time there is no point in		// If loop trip count (TC) is known at compile time there is no point in
// choosing VF greater than TC (as done in the loop below). Select maximum		// choosing VF greater than TC (as done in the loop below). Select maximum
// power of two which doesn't exceed TC.		// power of two which doesn't exceed TC.
// If MaxVectorElementCount is scalable, we only fall back on a fixed VF		// If MaxVectorElementCount is scalable, we only fall back on a fixed VF
// when the TC is less than or equal to the known number of lanes.		// when the TC is less than or equal to the known number of lanes.
auto ClampedConstTripCount = PowerOf2Floor(ConstTripCount);		auto ClampedConstTripCount = PowerOf2Floor(ConstTripCount);
LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "		LLVM_DEBUG(dbgs() << "LV: Clamping the MaxVF to maximum power of two not "
▲ Show 20 Lines • Show All 5,459 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/RISCV/short-trip-count.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -mtriple=riscv64 -mattr=+v -passes=loop-vectorize < %s \| FileCheck %s			; RUN: opt -S -mtriple=riscv64 -mattr=+v -passes=loop-vectorize < %s \| FileCheck %s

	; FIXME: Using a <4 x i32> would be strictly better than tail folded			define void @small_trip_count_min_vlen_128(i32* nocapture %a) nounwind vscale_range(4,1024) {
				david-armUnsubmitted Not Done Reply Inline Actions I think this comment is no longer valid. david-arm: I think this comment is no longer valid.
				david-armUnsubmitted Not Done Reply Inline Actions Perhaps it's better to have two tests now - one with min vscale = 4, and one with min vscale = 1, so that you can demonstrate the expected behaviour of using tail-folded loops for the latter as well? david-arm: Perhaps it's better to have two tests now - one with min vscale = 4, and one with min vscale =…
	; scalable vectorization in this case.			; CHECK-LABEL: @small_trip_count_min_vlen_128(
	define void @small_trip_count(i32* nocapture %a) nounwind vscale_range(4,1024) {			; CHECK-NEXT: entry:
	; CHECK-LABEL: @small_trip_count(			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i32 0
				; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
				; CHECK-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[WIDE_LOAD]], <i32 1, i32 1, i32 1, i32 1>
				; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
				; CHECK-NEXT: store <4 x i32> [[TMP4]], <4 x i32>* [[TMP5]], align 4
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
				; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 4, 4
				; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ 4, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[LOOP:%.*]]
				; CHECK: loop:
				; CHECK-NEXT: [[IV:%.]] = phi i32 [ [[IV_NEXT:%.]], [[LOOP]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds i32, i32 [[A]], i32 [[IV]]
				; CHECK-NEXT: [[V:%.]] = load i32, i32 [[GEP]], align 4
				; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[V]], 1
				; CHECK-NEXT: store i32 [[ADD]], i32* [[GEP]], align 4
				; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
				; CHECK-NEXT: [[COND:%.*]] = icmp eq i32 [[IV]], 3
				; CHECK-NEXT: br i1 [[COND]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP2:![0-9]+]]
				; CHECK: exit:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %loop

				loop:
				%iv = phi i32 [ %iv.next, %loop ], [ 0, %entry ]
				%gep = getelementptr inbounds i32, i32* %a, i32 %iv
				%v = load i32, i32* %gep, align 4
				%add = add nsw i32 %v, 1
				store i32 %add, i32* %gep, align 4
				%iv.next = add i32 %iv, 1
				%cond = icmp eq i32 %iv, 3
				br i1 %cond, label %exit, label %loop

				exit:
				ret void
				}

				; Note: This test uses a vscale_range starting at 1, which is technically incompatibile
				; with +v. If we expose a target hook for minimum vlen, this example will need reworked
				david-armUnsubmitted Not Done Reply Inline Actions Ah ok I hadn't realised this is incompatible. I mainly suggested min=1 as a way to still show the tail-folding behaviour for low trip counts, but I guess I could also have suggested raising the known trip count to a higher value. Anyway, thank you! david-arm: Ah ok I hadn't realised this is incompatible. I mainly suggested min=1 as a way to still show…
				define void @small_trip_count_min_vlen_32(i32* nocapture %a) nounwind vscale_range(1,1024) {
				; CHECK-LABEL: @small_trip_count_min_vlen_32(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i32 [[TMP0]], 2			; CHECK-NEXT: [[TMP1:%.*]] = mul i32 [[TMP0]], 2
	; CHECK-NEXT: [[TMP2:%.*]] = icmp ult i32 -5, [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = icmp ult i32 -5, [[TMP1]]
	; CHECK-NEXT: br i1 [[TMP2]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[TMP2]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP4:%.*]] = mul i32 [[TMP3]], 2			; CHECK-NEXT: [[TMP4:%.*]] = mul i32 [[TMP3]], 2
	Show All 14 Lines
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 2 x i32> @llvm.masked.load.nxv2i32.p0nxv2i32(<vscale x 2 x i32> [[TMP11]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i32> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 2 x i32> @llvm.masked.load.nxv2i32.p0nxv2i32(<vscale x 2 x i32> [[TMP11]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i32> poison)
	; CHECK-NEXT: [[TMP12:%.*]] = add nsw <vscale x 2 x i32> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i32 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer)			; CHECK-NEXT: [[TMP12:%.*]] = add nsw <vscale x 2 x i32> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 2 x i32> insertelement (<vscale x 2 x i32> poison, i32 1, i32 0), <vscale x 2 x i32> poison, <vscale x 2 x i32> zeroinitializer)
	; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP10]] to <vscale x 2 x i32>*			; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP10]] to <vscale x 2 x i32>*
	; CHECK-NEXT: call void @llvm.masked.store.nxv2i32.p0nxv2i32(<vscale x 2 x i32> [[TMP12]], <vscale x 2 x i32>* [[TMP13]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])			; CHECK-NEXT: call void @llvm.masked.store.nxv2i32.p0nxv2i32(<vscale x 2 x i32> [[TMP12]], <vscale x 2 x i32>* [[TMP13]], i32 4, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
	; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.vscale.i32()			; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.vscale.i32()
	; CHECK-NEXT: [[TMP15:%.*]] = mul i32 [[TMP14]], 2			; CHECK-NEXT: [[TMP15:%.*]] = mul i32 [[TMP14]], 2
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], [[TMP15]]			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], [[TMP15]]
	; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[IV:%.]] = phi i32 [ [[IV_NEXT:%.]], [[LOOP]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[IV:%.]] = phi i32 [ [[IV_NEXT:%.]], [[LOOP]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds i32, i32 [[A]], i32 [[IV]]			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds i32, i32 [[A]], i32 [[IV]]
	; CHECK-NEXT: [[V:%.]] = load i32, i32 [[GEP]], align 4			; CHECK-NEXT: [[V:%.]] = load i32, i32 [[GEP]], align 4
	; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[V]], 1			; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[V]], 1
	; CHECK-NEXT: store i32 [[ADD]], i32* [[GEP]], align 4			; CHECK-NEXT: store i32 [[ADD]], i32* [[GEP]], align 4
	; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1			; CHECK-NEXT: [[IV_NEXT]] = add i32 [[IV]], 1
	; CHECK-NEXT: [[COND:%.*]] = icmp eq i32 [[IV]], 3			; CHECK-NEXT: [[COND:%.*]] = icmp eq i32 [[IV]], 3
	; CHECK-NEXT: br i1 [[COND]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP2:![0-9]+]]			; CHECK-NEXT: br i1 [[COND]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP5:![0-9]+]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	br label %loop			br label %loop

	loop:			loop:
	%iv = phi i32 [ %iv.next, %loop ], [ 0, %entry ]			%iv = phi i32 [ %iv.next, %loop ], [ 0, %entry ]
	Show All 11 Lines