This is an archive of the discontinued LLVM Phabricator instance.

[LV] Optimize for size when vectorizing loops with tiny trip count
ClosedPublic

Authored by Ayal on Jun 19 2017, 4:49 PM.

Download Raw Diff

Details

Reviewers

mkuper
twoh
hfinkel

Commits

rG8d26f0a602f8: [LV] Optimize for size when vectorizing loops with tiny trip count
rL306803: [LV] Optimize for size when vectorizing loops with tiny trip count

Summary

Try to vectorize loops whose trip-count is smaller than TinyTripCountVectorThreshold under OptForSize constraint rather than not trying to vectorize them at all. The OptForSize constraint implies little if any overheads outside of the vectorized loop body, so the current cost estimate of the vectorized-vs-scalar loop body should hopefully be more/sufficiently accurate.

Also holds when the small value of the trip-count is based on profile data rather than static analysis, for potential cases where the trip-count is statically known to be divisible by the VF.

Patch inspired by D32451.

Diff Detail

Repository: rL LLVM

Event Timeline

Ayal created this revision.Jun 19 2017, 4:49 PM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptJun 19 2017, 4:49 PM

hfinkel added a subscriber: hfinkel.Jun 19 2017, 5:25 PM

hfinkel added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
7829 ↗	(On Diff #103123)	Please add some comments here explaining this behavior (it seems to make sense after reading the patch description, but without that context, I'd likely find this confusing).

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

In D34373#785367, @hfinkel wrote:

In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

In D34373#785485, @twoh wrote:

In D34373#785367, @hfinkel wrote:

In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.

In D34373#785642, @hfinkel wrote:

In D34373#785485, @twoh wrote:

In D34373#785367, @hfinkel wrote:

In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.

Actually you're right. The case I observed was a small static trip count loop completely unrolled and SLP vectorized which actually harms the performance, but not with LV. I think this patch should work if it effectively targets loops that are large enough to not to be unrolled.

hfinkel mentioned this in D32729: LV: Don't vectorize with unknown loop counts on divergent targets.Jun 20 2017, 1:51 PM

In D34373#785678, @twoh wrote:

In D34373#785642, @hfinkel wrote:

In D34373#785485, @twoh wrote:

In D34373#785367, @hfinkel wrote:

In D34373#784975, @twoh wrote:

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.

Actually you're right. The case I observed was a small static trip count loop completely unrolled and SLP vectorized which actually harms the performance, but not with LV. I think this patch should work if it effectively targets loops that are large enough to not to be unrolled.

I agree. Given that this will only vectorize loops that don't need a remainder loop, even if it would otherwise be unrolled, that should be fine. As you might be pointing out with you observation about SLP vectorization sometimes hurting performance, there certainly are cases where vectorization of small numbers of instructions can harm performance on OOO cores (for example, because they introduce additional data dependencies that might be more harmful than the corresponding increase in parallelism). It seems possible that the code for a small loop that comes out of the LV might have the same issue (if, for example, we generate unaligned vector loads, or access strided data and then shuffle it together). However, it is not clear to me that this will be the case. The SLP vectorizer has a minimum tree height of three, and for the LV to produce a loop that unrolls to less than three instructions, I assume it would need to essentially be a memcpy. I suspect that we'll need to try it and see if we find regressions.

Yes, we saw a couple of ~7% improvements running eembc benchmarks on x86.

This patch applies mostly to short static trip counts. For it to apply to short profile-based trip counts, they would need to be divisible by VF statically to avoid a remainder loop.

The current cost-model aims to estimate the relative performance of the loop body (only), when vectorized vs. original scalar version. The overheads of runtime guards and remainder loop may certainly outweigh the gains of the vectorized body, especially if the trip count is small; unless we know the former are not needed at all. If the body is expected to run faster when vectorized with a large trip count, it seems reasonable to expect it would do so with a small trip count, when all that's running is the body. Right?

Regarding unrolling such small trip-count loops, note that the loop-vectorizer itself may decide to do so, with interleaving.

Sure, will add comments explaining the logic behind turning on OptForSize in this case.

dorit added a subscriber: dorit.Jun 25 2017, 5:50 AM

Hello @Ayal, can you please update the comments per @hfinkel's request so that I can accept the patch? Thanks!

Updated version includes the comment requested by @hfinkel.

@mkuper, this is somewhat related to D26873; it conceptually allows vectorizing loops with low dynamic trip-counts that are also known to be divisible by VF statically; but current computeMaxVF() allows vectorizing loops under OptForSize only if their trip count is known statically, and known to be divisible by VF.

LGTM

This revision is now accepted and ready to land.Jun 28 2017, 10:56 AM

Closed by commit rL306803: [LV] Optimize for size when vectorizing loops with tiny trip count (authored by ayalz). · Explain WhyJun 30 2017, 1:02 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

59 lines

test/

Transforms/

LoopVectorize/

X86/

vect.omp.force.small-tc.ll

31 lines

small-loop.ll

6 lines

Diff 104829

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines

STATISTIC(LoopsVectorized, "Number of loops vectorized");		STATISTIC(LoopsVectorized, "Number of loops vectorized");
STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");		STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");

static cl::opt<bool>		static cl::opt<bool>
EnableIfConversion("enable-if-conversion", cl::init(true), cl::Hidden,		EnableIfConversion("enable-if-conversion", cl::init(true), cl::Hidden,
cl::desc("Enable if-conversion during vectorization."));		cl::desc("Enable if-conversion during vectorization."));

/// We don't vectorize loops with a known constant trip count below this number.		/// Loops with a known constant trip count below this number are vectorized only
		/// if no scalar iteration overheads are incurred.
static cl::opt<unsigned> TinyTripCountVectorThreshold(		static cl::opt<unsigned> TinyTripCountVectorThreshold(
"vectorizer-min-trip-count", cl::init(16), cl::Hidden,		"vectorizer-min-trip-count", cl::init(16), cl::Hidden,
cl::desc("Don't vectorize loops with a constant "		cl::desc("Loops with a constant trip count that is smaller than this "
"trip count that is smaller than this "		"value are vectorized only if no scalar iteration overheads "
"value."));		"are incurred."));

static cl::opt<bool> MaximizeBandwidth(		static cl::opt<bool> MaximizeBandwidth(
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,		"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
cl::desc("Maximize bandwidth when selecting vectorization factor which "		cl::desc("Maximize bandwidth when selecting vectorization factor which "
"will be determined by the smallest type in loop."));		"will be determined by the smallest type in loop."));

static cl::opt<bool> EnableInterleavedMemAccesses(		static cl::opt<bool> EnableInterleavedMemAccesses(
"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,		"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,
▲ Show 20 Lines • Show All 7,665 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
// less verbose reporting vectorized loops and unvectorized loops that may		// less verbose reporting vectorized loops and unvectorized loops that may
// benefit from vectorization, respectively.		// benefit from vectorization, respectively.

if (!Hints.allowVectorization(F, L, AlwaysVectorize)) {		if (!Hints.allowVectorization(F, L, AlwaysVectorize)) {
DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");		DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");
return false;		return false;
}		}

// Check the loop for a trip count threshold:		PredicatedScalarEvolution PSE(SE, L);
// do not vectorize loops with a tiny trip count.
		// Check if it is legal to vectorize the loop.
		LoopVectorizationRequirements Requirements(*ORE);
		LoopVectorizationLegality LVL(L, PSE, DT, TLI, AA, F, TTI, GetLAA, LI, ORE,
		&Requirements, &Hints);
		if (!LVL.canVectorize()) {
		DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
		emitMissedWarning(F, L, Hints, ORE);
		return false;
		}

		// Check the function attributes to find out if this function should be
		// optimized for size.
		bool OptForSize =
		Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

		// Check the loop for a trip count threshold: vectorize loops with a tiny trip
		// count by optimizing for size, to minimize overheads.
unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);		unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);
bool HasExpectedTC = (ExpectedTC > 0);		bool HasExpectedTC = (ExpectedTC > 0);

if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {		if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {
auto EstimatedTC = getLoopEstimatedTripCount(L);		auto EstimatedTC = getLoopEstimatedTripCount(L);
if (EstimatedTC) {		if (EstimatedTC) {
ExpectedTC = *EstimatedTC;		ExpectedTC = *EstimatedTC;
HasExpectedTC = true;		HasExpectedTC = true;
}		}
}		}

if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {		if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {
DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is not worth vectorizing.");		<< "This loop is worth vectorizing only if no scalar "
		<< "iteration overheads are incurred.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");		DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {		else {
DEBUG(dbgs() << "\n");		DEBUG(dbgs() << "\n");
ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),		// Loops with a very small trip count are considered for vectorization
"NotBeneficial", L)		// under OptForSize, thereby making sure the cost of their loop body is
<< "vectorization is not beneficial "		// dominant, free of runtime guards and scalar iteration overheads.
"and is not explicitly forced");		OptForSize = true;
return false;
}
}		}

PredicatedScalarEvolution PSE(SE, L);

// Check if it is legal to vectorize the loop.
LoopVectorizationRequirements Requirements(*ORE);
LoopVectorizationLegality LVL(L, PSE, DT, TLI, AA, F, TTI, GetLAA, LI, ORE,
&Requirements, &Hints);
if (!LVL.canVectorize()) {
DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
emitMissedWarning(F, L, Hints, ORE);
return false;
}		}

// Check the function attributes to find out if this function should be
// optimized for size.
bool OptForSize =
Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

// Check the function attributes to see if implicit floats are allowed.		// Check the function attributes to see if implicit floats are allowed.
// FIXME: This check doesn't seem possibly correct -- what if the loop is		// FIXME: This check doesn't seem possibly correct -- what if the loop is
// an integer loop and the vector instructions selected are purely integer		// an integer loop and the vector instructions selected are purely integer
// vector instructions?		// vector instructions?
if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {		if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {
DEBUG(dbgs() << "LV: Can't vectorize when the NoImplicitFloat"		DEBUG(dbgs() << "LV: Can't vectorize when the NoImplicitFloat"
"attribute is used.\n");		"attribute is used.\n");
ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),		ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),
▲ Show 20 Lines • Show All 250 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll

; RUN: opt < %s -loop-vectorize -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx -debug-only=loop-vectorize -stats -S -vectorizer-min-trip-count=21 2>&1 \| FileCheck %s		; RUN: opt < %s -loop-vectorize -mtriple=x86_64-apple-macosx10.8.0 -mcpu=corei7-avx -debug-only=loop-vectorize -stats -S -vectorizer-min-trip-count=21 2>&1 \| FileCheck %s
; REQUIRES: asserts		; REQUIRES: asserts

; CHECK: LV: Loop hints: force=enabled		; CHECK: LV: Loop hints: force=enabled
; CHECK: LV: Loop hints: force=?		; CHECK: LV: Loop hints: force=?
		; CHECK: LV: Loop hints: force=?
; No more loops in the module		; No more loops in the module
; CHECK-NOT: LV: Loop hints: force=		; CHECK-NOT: LV: Loop hints: force=
; CHECK: 2 loop-vectorize - Number of loops analyzed for vectorization		; CHECK: 3 loop-vectorize - Number of loops analyzed for vectorization
; CHECK: 1 loop-vectorize - Number of loops vectorized		; CHECK: 2 loop-vectorize - Number of loops vectorized

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.8.0"		target triple = "x86_64-apple-macosx10.8.0"

;		;
; The source code for the test:		; The source code for the test:
;		;
; void foo(float* restrict A, float* restrict B)		; void foo(float* restrict A, float* restrict B)
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	for.body:
br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !3		br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !3

for.end:		for.end:
ret void		ret void
}		}

!3 = !{!3}		!3 = !{!3}

		;
		; This loop will be vectorized as the trip count is below the threshold but no
		; scalar iterations are needed.
		;
		define void @vectorized2(float* noalias nocapture %A, float* noalias nocapture readonly %B) {
		entry:
		br label %for.body

		for.body:
		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
		%arrayidx = getelementptr inbounds float, float* %B, i64 %indvars.iv
		%0 = load float, float* %arrayidx, align 4, !llvm.mem.parallel_loop_access !3
		%arrayidx2 = getelementptr inbounds float, float* %A, i64 %indvars.iv
		%1 = load float, float* %arrayidx2, align 4, !llvm.mem.parallel_loop_access !3
		%add = fadd fast float %0, %1
		store float %add, float* %arrayidx2, align 4, !llvm.mem.parallel_loop_access !3
		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
		%exitcond = icmp eq i64 %indvars.iv.next, 16
		br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !4

		for.end:
		ret void
		}

		!4 = !{!4}

llvm/trunk/test/Transforms/LoopVectorize/small-loop.ll

	; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=4 -dce -instcombine -S \| FileCheck %s

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

	@a = common global [2048 x i32] zeroinitializer, align 16			@a = common global [2048 x i32] zeroinitializer, align 16
	@b = common global [2048 x i32] zeroinitializer, align 16			@b = common global [2048 x i32] zeroinitializer, align 16
	@c = common global [2048 x i32] zeroinitializer, align 16			@c = common global [2048 x i32] zeroinitializer, align 16

	;CHECK-LABEL: @example1(			;CHECK-LABEL: @example1(
	;CHECK-NOT: load <4 x i32>			;CHECK: load <4 x i32>
	;CHECK: ret void			;CHECK: ret void
	define void @example1() nounwind uwtable ssp {			define void @example1() nounwind uwtable ssp {
	br label %1			br label %1

	; <label>:1 ; preds = %1, %0			; <label>:1 ; preds = %1, %0
	%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]			%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
	%2 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %indvars.iv			%2 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %indvars.iv
	%3 = load i32, i32* %2, align 4			%3 = load i32, i32* %2, align 4
	%4 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %indvars.iv			%4 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %indvars.iv
	%5 = load i32, i32* %4, align 4			%5 = load i32, i32* %4, align 4
	%6 = add nsw i32 %5, %3			%6 = add nsw i32 %5, %3
	%7 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %indvars.iv			%7 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %indvars.iv
	store i32 %6, i32* %7, align 4			store i32 %6, i32* %7, align 4
	%indvars.iv.next = add i64 %indvars.iv, 1			%indvars.iv.next = add i64 %indvars.iv, 1
	%lftr.wideiv = trunc i64 %indvars.iv.next to i32			%lftr.wideiv = trunc i64 %indvars.iv.next to i32
	%exitcond = icmp eq i32 %lftr.wideiv, 8 ; <----- A really small trip count.			%exitcond = icmp eq i32 %lftr.wideiv, 8 ; <----- A really small trip count
	br i1 %exitcond, label %8, label %1			br i1 %exitcond, label %8, label %1 ; w/o scalar iteration overhead.

	; <label>:8 ; preds = %1			; <label>:8 ; preds = %1
	ret void			ret void
	}			}

	;CHECK-LABEL: @bound1(			;CHECK-LABEL: @bound1(
	;CHECK-NOT: load <4 x i32>			;CHECK-NOT: load <4 x i32>
	;CHECK: ret void			;CHECK: ret void
	Show All 22 Lines