This is an archive of the discontinued LLVM Phabricator instance.

[LV] Don't vectorize when we have a static bound on trip count
ClosedPublic

Authored by mkuper on Dec 12 2016, 3:39 PM.

Download Raw Diff

Details

Reviewers

gilr
mssimpso

Commits

rG3d23d4a2343a: [LV] Don't vectorize when we have a small static bound on trip count
rL289583: [LV] Don't vectorize when we have a small static bound on trip count

Summary

We check if the exact trip count is known and is smaller than the "tiny loop" bound. We should be checking the maximum bound on the trip count instead.

Note that right now this probably won't do a lot of good, since getSmallConstantMaxTripCount() seems to choke on the interesting cases, I hope to fix this in follow-ups.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 81155.Dec 12 2016, 3:39 PM

mkuper retitled this revision from to [LV] Don't vectorize when we have a static bound on trip count.

mkuper updated this object.

mkuper added reviewers: mssimpso, gilr.

mkuper added a subscriber: llvm-commits.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptDec 12 2016, 3:39 PM

LGTM.

This revision is now accepted and ready to land.Dec 12 2016, 4:35 PM

Can you please remind me why we're doing this? A loop with a small trip count should be unrolled if it is small, so we should really only be seeing here loops with small trip counts that have large bodies. That being the case, if vectorization looks profitable, why isn't it?

In D27690#620834, @hfinkel wrote:

Can you please remind me why we're doing this? A loop with a small trip count should be unrolled if it is small, so we should really only be seeing here loops with small trip counts that have large bodies. That being the case, if vectorization looks profitable, why isn't it?

There are two separate reasons it may not be profitable:

The one-time setup overhead for the vectorized loop - broadcasts, bounds computations for runtime checks, etc. is too high w.r.t the gain from vectorization.
We should never use a VF that is higher than the trip-count, since if we do, we're only paying the price for the overhead, while in practice we'll only execute the scalar remainder loop. This is, of course, completely independent of the size of the loop body. (That may change if we start using masked vector remainder loops, but I think Intel experimented with that, and didn't get very good results, so I don't see this happening in the near future.)

For (1), what we should do, if we want to be precise, is compute the overhead cost, and then take both the cost and the expected number of iterations into account as part of the cost model. As to (2), the right way to deal with it would be to upper-bound the VF choice by the trip-count.

However, since we don't really model the overhead correctly what we have (and have had since forever) is this big hammer of just bailing out on loops with a small trip count. What I'm doing is extending this to uniformly cover all kinds of loops - that is, not only loops with a known exact static trip count, but also dynamic trip counts (for PGO builds) and static trip count bounds.

My main motivations for that are:

a) Loops that have an unknown static trip-count, but according to the profile have very low iteration counts (0 or 1). These should not be unrolled, and should never be vectorized either.

b) Remainder loops for hand-vectorized code. These will also not be unrolled - the trip-count is unknown, and doesn't have a known multiple. (We may end up with runtime unrolling and yet another "remainder loop", which doesn't really improve things.) And, of course, it's almost always a bad idea to vectorize these. (The exception may be something like hand-vectorization by 16, with a scalar remainder loop. We may want to vectorize that remainder by 4 and leave a smaller scalar remainder, but that sounds like a very small win.)

I agree that we probably want to fix (1) eventually, but that will still need a good trip count estimate, so it's in some sense orthogonal to improving the estimate in cases where it's not statically known exactly.

In D27690#621294, @mkuper wrote:

In D27690#620834, @hfinkel wrote:

Can you please remind me why we're doing this? A loop with a small trip count should be unrolled if it is small, so we should really only be seeing here loops with small trip counts that have large bodies. That being the case, if vectorization looks profitable, why isn't it?

There are two separate reasons it may not be profitable:

The one-time setup overhead for the vectorized loop - broadcasts, bounds computations for runtime checks, etc. is too high w.r.t the gain from vectorization.

We should never use a VF that is higher than the trip-count, since if we do, we're only paying the price for the overhead, while in practice we'll only execute the scalar remainder loop. This is, of course, completely independent of the size of the loop body. (That may change if we start using masked vector remainder loops, but I think Intel experimented with that, and didn't get very good results, so I don't see this happening in the near future.)

For (1), what we should do, if we want to be precise, is compute the overhead cost, and then take both the cost and the expected number of iterations into account as part of the cost model. As to (2), the right way to deal with it would be to upper-bound the VF choice by the trip-count.

However, since we don't really model the overhead correctly what we have (and have had since forever) is this big hammer of just bailing out on loops with a small trip count. What I'm doing is extending this to uniformly cover all kinds of loops - that is, not only loops with a known exact static trip count, but also dynamic trip counts (for PGO builds) and static trip count bounds.

My main motivations for that are:

a) Loops that have an unknown static trip-count, but according to the profile have very low iteration counts (0 or 1). These should not be unrolled, and should never be vectorized either.

For 0 or 1, agreed.

b) Remainder loops for hand-vectorized code. These will also not be unrolled - the trip-count is unknown, and doesn't have a known multiple. (We may end up with runtime unrolling and yet another "remainder loop", which doesn't really improve things.) And, of course, it's almost always a bad idea to vectorize these. (The exception may be something like hand-vectorization by 16, with a scalar remainder loop. We may want to vectorize that remainder by 4 and leave a smaller scalar remainder, but that sounds like a very small win.)

I agree, but I think we're going about this the wrong way. The cost of the branching and runtime checks need to be factored into the cost model (which will be relevant for low-trip-count loops), and that should naturally prevent this kind of messiness. Just not vectorizing low-trip-count loops is suboptimial because it will miss cases where vectorization is quite profitable.

I agree that we probably want to fix (1) eventually, but that will still need a good trip count estimate, so it's in some sense orthogonal to improving the estimate in cases where it's not statically known exactly.

Closed by commit rL289583: [LV] Don't vectorize when we have a small static bound on trip count (authored by mkuper). · Explain WhyDec 13 2016, 12:48 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

4 lines

test/

Transforms/

LoopVectorize/

small-loop.ll

25 lines

Diff 81285

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,376 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */

if (!Hints.allowVectorization(F, L, AlwaysVectorize)) {		if (!Hints.allowVectorization(F, L, AlwaysVectorize)) {
DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");		DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");
return false;		return false;
}		}

// Check the loop for a trip count threshold:		// Check the loop for a trip count threshold:
// do not vectorize loops with a tiny trip count.		// do not vectorize loops with a tiny trip count.
const unsigned TC = SE->getSmallConstantTripCount(L);		const unsigned MaxTC = SE->getSmallConstantMaxTripCount(L);
if (TC > 0u && TC < TinyTripCountVectorThreshold) {		if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold) {
DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is not worth vectorizing.");		<< "This loop is not worth vectorizing.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");		DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {		else {
DEBUG(dbgs() << "\n");		DEBUG(dbgs() << "\n");
ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),		ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),
"NotBeneficial", L)		"NotBeneficial", L)
▲ Show 20 Lines • Show All 275 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/small-loop.ll

Show All 24 Lines	; <label>:1 ; preds = %1, %0
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, 8 ; <----- A really small trip count.		%exitcond = icmp eq i32 %lftr.wideiv, 8 ; <----- A really small trip count.
br i1 %exitcond, label %8, label %1		br i1 %exitcond, label %8, label %1

; <label>:8 ; preds = %1		; <label>:8 ; preds = %1
ret void		ret void
}		}

		;CHECK-LABEL: @bound1(
		;CHECK-NOT: load <4 x i32>
		;CHECK: ret void
		define void @bound1(i32 %k) nounwind uwtable ssp {
		br label %1

		; <label>:1 ; preds = %1, %0
		%indvars.iv = phi i64 [ 0, %0 ], [ %indvars.iv.next, %1 ]
		%2 = getelementptr inbounds [2048 x i32], [2048 x i32]* @b, i64 0, i64 %indvars.iv
		%3 = load i32, i32* %2, align 4
		%4 = getelementptr inbounds [2048 x i32], [2048 x i32]* @c, i64 0, i64 %indvars.iv
		%5 = load i32, i32* %4, align 4
		%6 = add nsw i32 %5, %3
		%7 = getelementptr inbounds [2048 x i32], [2048 x i32]* @a, i64 0, i64 %indvars.iv
		store i32 %6, i32* %7, align 4
		%indvars.iv.next = add i64 %indvars.iv, 1
		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
		%large = icmp sge i32 %lftr.wideiv, 8
		%exitcond = icmp eq i32 %lftr.wideiv, %k
		%realexit = or i1 %large, %exitcond
		br i1 %realexit, label %8, label %1

		; <label>:8 ; preds = %1
		ret void
		}