This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
tail_loop_folding.ll

Differential D66108

[LV] fold-tail flag
ClosedPublic

Authored by dorit on Aug 12 2019, 1:11 PM.

Download Raw Diff

Details

Reviewers

Ayal
hsaito
fhahn
SjoerdMeijer

Commits

rG491ca2425d4a: [LV] Fold-tail flag
rL368801: [LV] Fold-tail flag

Summary

This is the compiler-flag equivalent of the Predicate pragma (https://reviews.llvm.org/D65197), to direct the vectorizer to fold the remainer-loop into the main-loop using predication.

Diff Detail

Repository: rL LLVM

Event Timeline

dorit created this revision.Aug 12 2019, 1:11 PM

Herald added a subscriber: rkruppe. · View Herald TranscriptAug 12 2019, 1:11 PM

SjoerdMeijer added inline comments.Aug 12 2019, 11:34 PM

test/Transforms/LoopVectorize/X86/tail_loop_folding.ll
2 ↗	(On Diff #214694)	Because these test cases have the `vectorize.predicate.enable` metadata set, I am not sure we are actually testing this new option. I don't think so, I think we need a separate function without the predicate metadata.
17 ↗	(On Diff #214694)	I expect the output to be the same whether a pragma was used or this new options, so can we just use the CHECK tag?

Thanks for taking a look! Please see responses below.

test/Transforms/LoopVectorize/X86/tail_loop_folding.ll
2 ↗	(On Diff #214694)	"Because these test cases have the vectorize.predicate.enable metadata set" not really, the second function has the vectorize.predicate.enable metadata disabled. This is why in the second function without the new flag the check for no masked loads/stores passes (see CHECK part), and with the new flag the check expects to find masked loads/stores (see PREDFLAG part).
17 ↗	(On Diff #214694)	I'm not sure I understand what you are suggesting... I'm trying to distinguish between two runs - one without the flag, and one with the flag. In the first function the output is indeed the same for both runs; but this is not the case in the second function: if I'll have the second run (with the flag) check all the CHECK tags I will fail in the second function where the output differs.

Ah sorry, ignore me! I messed that up.

This looks like a good and straightforward change to me.

This revision is now accepted and ready to land.Aug 13 2019, 2:59 AM

Closed by commit rL368801: [LV] Fold-tail flag (authored by dorit). · Explain WhyAug 13 2019, 10:21 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

18 lines

test/

Transforms/

LoopVectorize/

X86/

tail_loop_folding.ll

20 lines

Diff 215026

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 171 Lines • ▼ Show 20 Lines
/// Loops with a known constant trip count below this number are vectorized only		/// Loops with a known constant trip count below this number are vectorized only
/// if no scalar iteration overheads are incurred.		/// if no scalar iteration overheads are incurred.
static cl::opt<unsigned> TinyTripCountVectorThreshold(		static cl::opt<unsigned> TinyTripCountVectorThreshold(
"vectorizer-min-trip-count", cl::init(16), cl::Hidden,		"vectorizer-min-trip-count", cl::init(16), cl::Hidden,
cl::desc("Loops with a constant trip count that is smaller than this "		cl::desc("Loops with a constant trip count that is smaller than this "
"value are vectorized only if no scalar iteration overheads "		"value are vectorized only if no scalar iteration overheads "
"are incurred."));		"are incurred."));

		// Indicates that an epilogue is undesired, predication is preferred.
		// This means that the vectorizer will try to fold the loop-tail (epilogue)
		// into the loop and predicate the loop body accordingly.
		static cl::opt<bool> PreferPredicateOverEpilog(
		"prefer-predicate-over-epilog", cl::init(false), cl::Hidden,
		cl::desc("Indicate that an epilogue is undesired, predication should be "
		"used instead."));

static cl::opt<bool> MaximizeBandwidth(		static cl::opt<bool> MaximizeBandwidth(
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,		"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
cl::desc("Maximize bandwidth when selecting vectorization factor which "		cl::desc("Maximize bandwidth when selecting vectorization factor which "
"will be determined by the smallest type in loop."));		"will be determined by the smallest type in loop."));

static cl::opt<bool> EnableInterleavedMemAccesses(		static cl::opt<bool> EnableInterleavedMemAccesses(
"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,		"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,
cl::desc("Enable vectorization on interleaved memory accesses in a loop"));		cl::desc("Enable vectorization on interleaved memory accesses in a loop"));
▲ Show 20 Lines • Show All 713 Lines • ▼ Show 20 Lines	enum ScalarEpilogueLowering {

// A special case of vectorisation with OptForSize: loops with a very small		// A special case of vectorisation with OptForSize: loops with a very small
// trip count are considered for vectorization under OptForSize, thereby		// trip count are considered for vectorization under OptForSize, thereby
// making sure the cost of their loop body is dominant, free of runtime		// making sure the cost of their loop body is dominant, free of runtime
// guards and scalar iteration overheads.		// guards and scalar iteration overheads.
CM_ScalarEpilogueNotAllowedLowTripLoop,		CM_ScalarEpilogueNotAllowedLowTripLoop,

// Loop hint predicate indicating an epilogue is undesired.		// Loop hint predicate indicating an epilogue is undesired.
CM_ScalarEpilogueNotNeededPredicatePragma		CM_ScalarEpilogueNotNeededUsePredicate
};		};

/// LoopVectorizationCostModel - estimates the expected speedups due to		/// LoopVectorizationCostModel - estimates the expected speedups due to
/// vectorization.		/// vectorization.
/// In many cases vectorization is not profitable. This can happen because of		/// In many cases vectorization is not profitable. This can happen because of
/// a number of reasons. In this class we mainly attempt to predict the		/// a number of reasons. In this class we mainly attempt to predict the
/// expected speedup/slowdowns due to the supported instruction set. We use the		/// expected speedup/slowdowns due to the supported instruction set. We use the
/// TargetTransformInfo to query the different backends for the cost of		/// TargetTransformInfo to query the different backends for the cost of
▲ Show 20 Lines • Show All 3,881 Lines • ▼ Show 20 Lines	reportVectorizationFailure("Single iteration (non) loop",
"loop trip count is one, irrelevant for vectorization",		"loop trip count is one, irrelevant for vectorization",
"SingleIterationLoop", ORE, TheLoop);		"SingleIterationLoop", ORE, TheLoop);
return None;		return None;
}		}

switch (ScalarEpilogueStatus) {		switch (ScalarEpilogueStatus) {
case CM_ScalarEpilogueAllowed:		case CM_ScalarEpilogueAllowed:
return computeFeasibleMaxVF(TC);		return computeFeasibleMaxVF(TC);
case CM_ScalarEpilogueNotNeededPredicatePragma:		case CM_ScalarEpilogueNotNeededUsePredicate:
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: vector predicate hint found.\n"		dbgs() << "LV: vector predicate hint/switch found.\n"
<< "LV: Not allowing scalar epilogue, creating predicated "		<< "LV: Not allowing scalar epilogue, creating predicated "
<< "vector loop.\n");		<< "vector loop.\n");
break;		break;
case CM_ScalarEpilogueNotAllowedLowTripLoop:		case CM_ScalarEpilogueNotAllowedLowTripLoop:
// fallthrough as a special case of OptForSize		// fallthrough as a special case of OptForSize
case CM_ScalarEpilogueNotAllowedOptSize:		case CM_ScalarEpilogueNotAllowedOptSize:
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)		if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)
LLVM_DEBUG(		LLVM_DEBUG(
▲ Show 20 Lines • Show All 2,475 Lines • ▼ Show 20 Lines
static ScalarEpilogueLowering		static ScalarEpilogueLowering
getScalarEpilogueLowering(Function F, Loop L, LoopVectorizeHints &Hints,		getScalarEpilogueLowering(Function F, Loop L, LoopVectorizeHints &Hints,
ProfileSummaryInfo PSI, BlockFrequencyInfo BFI) {		ProfileSummaryInfo PSI, BlockFrequencyInfo BFI) {
ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;		ScalarEpilogueLowering SEL = CM_ScalarEpilogueAllowed;
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&		if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
(F->hasOptSize() \|\|		(F->hasOptSize() \|\|
llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))		llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI)))
SEL = CM_ScalarEpilogueNotAllowedOptSize;		SEL = CM_ScalarEpilogueNotAllowedOptSize;
else if (Hints.getPredicate())		else if (PreferPredicateOverEpilog \|\| Hints.getPredicate())
SEL = CM_ScalarEpilogueNotNeededPredicatePragma;		SEL = CM_ScalarEpilogueNotNeededUsePredicate;

return SEL;		return SEL;
}		}

// Process the loop in the VPlan-native vectorization path. This path builds		// Process the loop in the VPlan-native vectorization path. This path builds
// VPlan upfront in the vectorization pipeline, which allows to apply		// VPlan upfront in the vectorization pipeline, which allows to apply
// VPlan-to-VPlan transformations from the very beginning without modifying the		// VPlan-to-VPlan transformations from the very beginning without modifying the
// input LLVM IR.		// input LLVM IR.
▲ Show 20 Lines • Show All 476 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll

	; RUN: opt < %s -loop-vectorize -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -S \| FileCheck %s
				; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| FileCheck -check-prefix=PREDFLAG %s

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {			define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
	; CHECK-LABEL: tail_folding_enabled(			; CHECK-LABEL: tail_folding_enabled(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(			; CHECK: %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
	; CHECK: %wide.masked.load1 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(			; CHECK: %wide.masked.load1 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
	; CHECK: %8 = add nsw <8 x i32> %wide.masked.load1, %wide.masked.load			; CHECK: %8 = add nsw <8 x i32> %wide.masked.load1, %wide.masked.load
	; CHECK: call void @llvm.masked.store.v8i32.p0v8i32(			; CHECK: call void @llvm.masked.store.v8i32.p0v8i32(
	; CHECK: %index.next = add i64 %index, 8			; CHECK: %index.next = add i64 %index, 8
	; CHECK: %12 = icmp eq i64 %index.next, 432			; CHECK: %12 = icmp eq i64 %index.next, 432
	; CHECK: br i1 %12, label %middle.block, label %vector.body, !llvm.loop !0			; CHECK: br i1 %12, label %middle.block, label %vector.body, !llvm.loop !0
				; PREDFLAG-LABEL: tail_folding_enabled(
				; PREDFLAG: vector.body:
				; PREDFLAG: %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; PREDFLAG: %wide.masked.load1 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; PREDFLAG: %8 = add nsw <8 x i32> %wide.masked.load1, %wide.masked.load
				; PREDFLAG: call void @llvm.masked.store.v8i32.p0v8i32(
				; PREDFLAG: %index.next = add i64 %index, 8
				; PREDFLAG: %12 = icmp eq i64 %index.next, 432
				; PREDFLAG: br i1 %12, label %middle.block, label %vector.body, !llvm.loop !0
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup:			for.cond.cleanup:
	ret void			ret void

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 10 Lines
	}			}

	define dso_local void @tail_folding_disabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {			define dso_local void @tail_folding_disabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
	; CHECK-LABEL: tail_folding_disabled(			; CHECK-LABEL: tail_folding_disabled(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(			; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(
	; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(			; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(
	; CHECK: br i1 %44, label {{.*}}, label %vector.body			; CHECK: br i1 %44, label {{.*}}, label %vector.body
				; PREDFLAG-LABEL: tail_folding_disabled(
				; PREDFLAG: vector.body:
				; PREDFLAG: %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; PREDFLAG: %wide.masked.load1 = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(
				; PREDFLAG: %8 = add nsw <8 x i32> %wide.masked.load1, %wide.masked.load
				; PREDFLAG: call void @llvm.masked.store.v8i32.p0v8i32(
				; PREDFLAG: %index.next = add i64 %index, 8
				; PREDFLAG: %12 = icmp eq i64 %index.next, 432
				; PREDFLAG: br i1 %12, label %middle.block, label %vector.body, !llvm.loop !4
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup:			for.cond.cleanup:
	ret void			ret void

	for.body:			for.body:
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	Show All 28 Lines