This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Permit tail-folding for low trip counts using scalable vectors
ClosedPublic

Authored by david-arm on Mar 14 2022, 4:44 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
frasercrmck
peterwaller-arm

Commits

rGbefc95204506: [LoopVectorize] Permit tail-folding for low trip counts using scalable vectors

Summary

When the loop vectoriser encounters a known low trip count it tries
to create a single predicated loop in order to get the benefit of
vectorisation and eliminate the scalar tail. However, until now the
vectoriser prevented the use of scalable vectors in this case due
to concerns in the past about stability. I believe that tail-folded
loops using scalable vectors are now sufficiently well tested that
we can enable this. For the same reason I've also enabled it when
optimising for code size too.

Tests added here:

Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll
Transforms/LoopVectorize/RISCV/low-trip-count.ll

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Mar 14 2022, 4:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 4:44 AM

Herald added subscribers: luke957, ctetreau, luismarques and 21 others. · View Herald Transcript

david-arm requested review of this revision.Mar 14 2022, 4:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 4:44 AM

Herald added subscribers: llvm-commits, alextsao1999, • pcwang-thead, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B154083: Diff 415063.Mar 14 2022, 5:36 AM

Added more CHECK lines to the new tests because they will be useful for a future patch.

Herald added a subscriber: arichardson. · View Herald TranscriptMar 17 2022, 3:36 AM

david-arm added a child revision: D121899: [LoopVectorize] Optimise away the icmp when tail-folding for some low trip counts.Mar 17 2022, 3:38 AM

sdesmalen added inline comments.Mar 17 2022, 3:59 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5135–5136	I think we can remove the condition entirely, so that we consider using scalable vectors + tail folding when optimising for code-size as well. I know that for SVE we'll want to improve code quality to avoid the redundant compare, but when optimising for code-size the user has made the decision that code-size is more important than performance. And the cost-model will still have a say in which is more beneficial (scalar, fixed or scalable) and may still choose a fixed-width VF in case the ScalableVF may not be legal for the loop.

Harbormaster completed remote builds in B154804: Diff 416123.Mar 17 2022, 4:40 AM

Matt added a subscriber: Matt.Mar 17 2022, 5:48 PM

Completely removed all restrictions on using tail-folding for scalable vectors.
Added test to show we apply tail-folding when compiling with -Os

david-arm marked an inline comment as done.Apr 27 2022, 8:20 AM

Harbormaster completed remote builds in B161611: Diff 425526.Apr 27 2022, 9:08 AM

Gentle ping.

sdesmalen accepted this revision.May 12 2022, 6:37 AM

This revision is now accepted and ready to land.May 12 2022, 6:37 AM

This revision was landed with ongoing or failed builds.May 16 2022, 1:14 AM

Closed by commit rGbefc95204506: [LoopVectorize] Permit tail-folding for low trip counts using scalable vectors (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGbefc95204506: [LoopVectorize] Permit tail-folding for low trip counts using scalable vectors.

• pcwang-thead mentioned this in D125747: [RISCV] Enable scalable vectorization by default for RVV.May 17 2022, 11:39 PM

dtemirbulatov mentioned this in D130755: [LoopVectorize] Introduce trip count minimal value threshold to ignore tail-folding for scalable vectors.Jul 29 2022, 3:38 AM

dtemirbulatov mentioned this in rGcab6cd683402: [AArch64][LoopVectorize] Introduce trip count minimal value threshold to ignore….Aug 9 2022, 2:11 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

8 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-low-trip-count.ll

73 lines

sve-tail-folding-optsize.ll

39 lines

RISCV/

low-trip-count.ll

33 lines

Diff 429639

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,126 Lines • ▼ Show 20 Lines	const SCEV *Rem = SE->getURemExpr(
SE->applyLoopGuards(ExitCount, TheLoop),		SE->applyLoopGuards(ExitCount, TheLoop),
SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));		SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));
if (Rem->isZero()) {		if (Rem->isZero()) {
// Accept MaxFixedVF if we do not have a tail.		// Accept MaxFixedVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxFactors;		return MaxFactors;
}		}
}		}

// For scalable vectors don't use tail folding for low trip counts or
// optimizing for code size. We only permit this if the user has explicitly
// requested it.
if (ScalarEpilogueStatus != CM_ScalarEpilogueNotNeededUsePredicate &&
ScalarEpilogueStatus != CM_ScalarEpilogueNotAllowedUsePredicate &&
MaxFactors.ScalableVF.isVector())
MaxFactors.ScalableVF = ElementCount::getScalable(0);

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
		sdesmalenUnsubmitted Done Reply Inline Actions I think we can remove the condition entirely, so that we consider using scalable vectors + tail folding when optimising for code-size as well. I know that for SVE we'll want to improve code quality to avoid the redundant compare, but when optimising for code-size the user has made the decision that code-size is more important than performance. And the cost-model will still have a say in which is more beneficial (scalar, fixed or scalable) and may still choose a fixed-width VF in case the ScalableVF may not be legal for the loop. sdesmalen: I think we can remove the condition entirely, so that we consider using scalable vectors + tail…
// found modulo the vectorization factor is not zero, try to fold the tail		// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.		// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {		if (Legal->prepareToFoldTailByMasking()) {
FoldTailByMasking = true;		FoldTailByMasking = true;
return MaxFactors;		return MaxFactors;
}		}

▲ Show 20 Lines • Show All 5,678 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-low-trip-count.ll

This file was added.

				; RUN: opt -loop-vectorize -S < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				define void @trip7_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {
				; CHECK-LABEL: @trip7_i64(
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
				; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 {{%.}}, i64 7)
				; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)
				; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)
				; CHECK: call void @llvm.masked.store.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.}}, <vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
				; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]
				; CHECK-NEXT: [[COND:%.]] = icmp eq i64 [[INDEX_NEXT]], {{%.}}
				; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body
				;
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06
				%0 = load i64, i64* %arrayidx, align 8
				%mul = shl nsw i64 %0, 1
				%arrayidx1 = getelementptr inbounds i64, i64* %dst, i64 %i.06
				%1 = load i64, i64* %arrayidx1, align 8
				%add = add nsw i64 %1, %mul
				store i64 %add, i64* %arrayidx1, align 8
				%inc = add nuw nsw i64 %i.06, 1
				%exitcond.not = icmp eq i64 %inc, 7
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

				define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {
				; CHECK-LABEL: @trip5_i8(
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
				; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 16 x i1> @llvm.get.active.lane.mask.nxv16i1.i64(i64 {{%.}}, i64 5)
				; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
				; CHECK: {{%.}} = call <vscale x 16 x i8> @llvm.masked.load.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]], <vscale x 16 x i8> poison)
				; CHECK: call void @llvm.masked.store.nxv16i8.p0nxv16i8(<vscale x 16 x i8> {{%.}}, <vscale x 16 x i8> {{%.*}}, i32 1, <vscale x 16 x i1> [[ACTIVE_LANE_MASK]])
				; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 16
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]
				; CHECK-NEXT: [[COND:%.]] = icmp eq i64 [[INDEX_NEXT]], {{%.}}
				; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body
				;
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %src, i64 %i.08
				%0 = load i8, i8* %arrayidx, align 1
				%mul = shl i8 %0, 1
				%arrayidx1 = getelementptr inbounds i8, i8* %dst, i64 %i.08
				%1 = load i8, i8* %arrayidx1, align 1
				%add = add i8 %mul, %1
				store i8 %add, i8* %arrayidx1, align 1
				%inc = add nuw nsw i64 %i.08, 1
				%exitcond.not = icmp eq i64 %inc, 5
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

				attributes #0 = { vscale_range(1,16) "target-features"="+sve" }

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-optsize.ll

This file was added.

				; RUN: opt -loop-vectorize -S < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				define void @trip1024_i64(i64* noalias nocapture noundef %dst, i64* noalias nocapture noundef readonly %src) #0 {
				; CHECK-LABEL: @trip1024_i64(
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
				; CHECK: [[ACTIVE_LANE_MASK:%.]] = call <vscale x 2 x i1> @llvm.get.active.lane.mask.nxv2i1.i64(i64 {{%.}}, i64 1024)
				; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)
				; CHECK: {{%.}} = call <vscale x 2 x i64> @llvm.masked.load.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]], <vscale x 2 x i64> poison)
				; CHECK: call void @llvm.masked.store.nxv2i64.p0nxv2i64(<vscale x 2 x i64> {{%.}}, <vscale x 2 x i64> {{%.*}}, i32 8, <vscale x 2 x i1> [[ACTIVE_LANE_MASK]])
				; CHECK: [[VSCALE:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[VF:%.*]] = mul i64 [[VSCALE]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[VF]]
				; CHECK-NEXT: [[COND:%.]] = icmp eq i64 [[INDEX_NEXT]], {{%.}}
				; CHECK-NEXT: br i1 [[COND]], label %middle.block, label %vector.body
				;
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.06 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds i64, i64* %src, i64 %i.06
				%0 = load i64, i64* %arrayidx, align 8
				%mul = shl nsw i64 %0, 1
				%arrayidx1 = getelementptr inbounds i64, i64* %dst, i64 %i.06
				%1 = load i64, i64* %arrayidx1, align 8
				%add = add nsw i64 %1, %mul
				store i64 %add, i64* %arrayidx1, align 8
				%inc = add nuw nsw i64 %i.06, 1
				%exitcond.not = icmp eq i64 %inc, 1024
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

				attributes #0 = { vscale_range(1,16) "target-features"="+sve" optsize }

llvm/test/Transforms/LoopVectorize/RISCV/low-trip-count.ll

This file was added.

				; RUN: opt -loop-vectorize -riscv-v-vector-bits-min=128 -scalable-vectorization=on -force-target-instruction-cost=1 -S < %s \| FileCheck %s

				target triple = "riscv64"

				define void @trip5_i8(i8* noalias nocapture noundef %dst, i8* noalias nocapture noundef readonly %src) #0 {
				; CHECK-LABEL: @trip5_i8(
				; CHECK: vector.body:
				; CHECK: [[ACTIVE_LANE_MASK:%.]] = icmp ule <vscale x 8 x i64> {{%.}}, shufflevector (<vscale x 8 x i64> insertelement (<vscale x 8 x i64> poison, i64 4, i32 0), <vscale x 8 x i64> poison, <vscale x 8 x i32> zeroinitializer)
				; CHECK: {{%.}} = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0nxv8i8(<vscale x 8 x i8> {{%.*}}, i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
				; CHECK: {{%.}} = call <vscale x 8 x i8> @llvm.masked.load.nxv8i8.p0nxv8i8(<vscale x 8 x i8> {{%.*}}, i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]], <vscale x 8 x i8> poison)
				; CHECK: call void @llvm.masked.store.nxv8i8.p0nxv8i8(<vscale x 8 x i8> {{%.}}, <vscale x 8 x i8> {{%.*}}, i32 1, <vscale x 8 x i1> [[ACTIVE_LANE_MASK]])
				;
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.08 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %src, i64 %i.08
				%0 = load i8, i8* %arrayidx, align 1
				%mul = shl i8 %0, 1
				%arrayidx1 = getelementptr inbounds i8, i8* %dst, i64 %i.08
				%1 = load i8, i8* %arrayidx1, align 1
				%add = add i8 %mul, %1
				store i8 %add, i8* %arrayidx1, align 1
				%inc = add nuw nsw i64 %i.08, 1
				%exitcond.not = icmp eq i64 %inc, 5
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

				attributes #0 = { "target-features"="+v,+d" }