Diff 133029

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 202 Lines • ▼ Show 20 Lines	cl::desc("A flag that overrides the target's expected cost for "
"useful for getting consistent testing."));		"useful for getting consistent testing."));

static cl::opt<unsigned> SmallLoopCost(		static cl::opt<unsigned> SmallLoopCost(
"small-loop-cost", cl::init(20), cl::Hidden,		"small-loop-cost", cl::init(20), cl::Hidden,
cl::desc(		cl::desc(
"The cost of a loop that is considered 'small' by the interleaver."));		"The cost of a loop that is considered 'small' by the interleaver."));

static cl::opt<bool> LoopVectorizeWithBlockFrequency(		static cl::opt<bool> LoopVectorizeWithBlockFrequency(
"loop-vectorize-with-block-frequency", cl::init(false), cl::Hidden,		"loop-vectorize-with-block-frequency", cl::init(true), cl::Hidden,
cl::desc("Enable the use of the block frequency analysis to access PGO "		cl::desc("Enable the use of the block frequency analysis to access PGO "
"heuristics minimizing code growth in cold regions and being more "		"heuristics minimizing code growth in cold regions and being more "
"aggressive in hot regions."));		"aggressive in hot regions."));

// Runtime interleave loops for load/store throughput.		// Runtime interleave loops for load/store throughput.
static cl::opt<bool> EnableLoadStoreRuntimeInterleave(		static cl::opt<bool> EnableLoadStoreRuntimeInterleave(
"enable-loadstore-runtime-interleave", cl::init(true), cl::Hidden,		"enable-loadstore-runtime-interleave", cl::init(true), cl::Hidden,
cl::desc(		cl::desc(
▲ Show 20 Lines • Show All 8,122 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */

// Check the function attributes to find out if this function should be		// Check the function attributes to find out if this function should be
// optimized for size.		// optimized for size.
bool OptForSize =		bool OptForSize =
Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();		Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

// Check the loop for a trip count threshold: vectorize loops with a tiny trip		// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.		// count by optimizing for size, to minimize overheads.
unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);		// Prefer constant trip counts over profile data, over upper bound estimate.
bool HasExpectedTC = (ExpectedTC > 0);		unsigned ExpectedTC = 0;
		bool HasExpectedTC = false;
		if (const SCEVConstant *ConstExits =
		dyn_cast<SCEVConstant>(SE->getBackedgeTakenCount(L))) {
		const APInt &ExitsCount = ConstExits->getAPInt();
		// We are interested in small values for ExpectedTC. Skip over those that
		bkramerUnsubmitted Not Done Reply Inline Actions I'd prefer something like ConstExits->getValue()->getActiveBits() <= 32. getZExtValue is not safe for values larger than 64 bits, which are rare but can happen in theory. bkramer: I'd prefer something like ConstExits->getValue()->getActiveBits() <= 32. getZExtValue is not…
		// can't fit an unsigned.
		if (ExitsCount.getActiveBits() <= 32 &&
		ExitsCount.getZExtValue() < std::numeric_limits<unsigned>::max()) {
		bkramerUnsubmitted Not Done Reply Inline Actions This is now redundant ;) bkramer: This is now redundant ;)
		mtrofinAuthorUnsubmitted Not Done Reply Inline Actions Why? getZExtValue() could be 0xffff ffff (getActiveBits() == 32). That would overflow on line 8358. mtrofin: Why? getZExtValue() could be 0xffff ffff (getActiveBits() == 32). That would overflow on line…
		bkramerUnsubmitted Not Done Reply Inline Actions I see. Off by one. I should've suggested ExitsCount.ult(std::numeric_limits<unsigned>::max()) in that case, which should do the same as your current condition. bkramer: I see. Off by one. I should've suggested ExitsCount.ult(std::numeric_limits<unsigned>::max())…
		mtrofinAuthorUnsubmitted Not Done Reply Inline Actions Ah - thanks, that looks better, indeed! mtrofin: Ah - thanks, that looks better, indeed!
		ExpectedTC = static_cast<unsigned>(ExitsCount.getZExtValue()) + 1;
		HasExpectedTC = true;
		}
		}
		// ExpectedTC may be large because it's bound by a variable. Check
		// profiling information to validate we should vectorize.
if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {		if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {
		davidxlUnsubmitted Not Done Reply Inline Actions What is the ExpectedTC returned in this case? Why does it not return CouldNotCompute (or 0)? davidxl: What is the ExpectedTC returned in this case? Why does it not return CouldNotCompute (or 0)?
		mtrofinAuthorUnsubmitted Not Done Reply Inline Actions For example, for: for (unsigned i = 1; i < something; ++i) { ... } it returns 0xffff fffe - assuming `unsigned something`. That's correct - that's the maximum times the loop body would be executed, because `something` might at most be 0xffff ffff. If the step were 2 instead of 1, the ExpectedTC would reflect that accordingly (meaning, the trip count would be half). In contrast, CouldNotCompute is produced in cases like unsupported loops, for instance those with multiple exits. When it comes to '0', the API behaves a bit unexpectedly, I'd argue: in cases like the ones in the current set of unit tests, it ends up adding 1 to the max taken back edges (which is 0xffff ffff there), which through overflow gets us to 0. That was the reason those tests were passing. I'd love to hear others' thoughts are on this overflow behavior, and whether there's any reason we shouldn't refactor it, because relying on the overflow to represent "not having an expected trip count" falls through in cases like I just presented (e.g. non-unitary step, starting from something else than 0, etc) This issue of determining "no expected trip counts" aside, I'd argue the issue addressed here can be addressed separately. mtrofin: For example, for: ``` for (unsigned i = 1; i < something; ++i) { ... } ``` it returns…
		davidxlUnsubmitted Not Done Reply Inline Actions I think the logic here should implement the order of precedence in this way: If there is known constant trip count (such as for( ...; i < 1000; i++), use that as the expectedTC otherwise if there is profile based estimated trip count, use it as the expectedTC use the smallConstantMaxTripCount as an estimation. davidxl: I think the logic here should implement the order of precedence in this way: 1) If there is…
		mtrofinAuthorUnsubmitted Not Done Reply Inline Actions That would make it more clear, indeed. mtrofin: That would make it more clear, indeed.
auto EstimatedTC = getLoopEstimatedTripCount(L);		auto EstimatedTC = getLoopEstimatedTripCount(L);
if (EstimatedTC) {		if (EstimatedTC) {
ExpectedTC = *EstimatedTC;		ExpectedTC = *EstimatedTC;
HasExpectedTC = true;		HasExpectedTC = true;
}		}
}		}
		if (!HasExpectedTC) {
		ExpectedTC = SE->getSmallConstantMaxTripCount(L);
		HasExpectedTC = (ExpectedTC > 0);
		}

if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {		if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {
DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "		<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");		<< "iteration overheads are incurred.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");		DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {		else {
▲ Show 20 Lines • Show All 277 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/tripcount.ll

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
%exitcond = icmp eq i32 %i.08, %bound		%exitcond = icmp eq i32 %i.08, %bound
br i1 %exitcond, label %for.end, label %for.body, !prof !1		br i1 %exitcond, label %for.end, label %for.body, !prof !1

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret i32 0		ret i32 0
}		}

define i32 @foo_low_trip_count3(i1 %cond, i32 %bound) !prof !0 {		define i32 @foo_low_trip_count3(i1 %cond, i32 %bound) !prof !0 {
; The loop has low invocation count compare to the function invocation count,		; The loop has low invocation count compare to the function invocation count,
; but has a high trip count per invocation. Vectorize it.		; but has a high trip count per invocation. Vectorize it.

; CHECK-LABEL: @foo_low_trip_count3(		; CHECK-LABEL: @foo_low_trip_count3(
; CHECK: vector.body:		; CHECK: vector.body:

entry:		entry:
br i1 %cond, label %for.preheader, label %for.end, !prof !2		br i1 %cond, label %for.preheader, label %for.end, !prof !2

Show All 10 Lines	for.body: ; preds = %for.body, %entry
%inc = add nsw i32 %i.08, 1		%inc = add nsw i32 %i.08, 1
%exitcond = icmp eq i32 %i.08, %bound		%exitcond = icmp eq i32 %i.08, %bound
br i1 %exitcond, label %for.end, label %for.body, !prof !3		br i1 %exitcond, label %for.end, label %for.body, !prof !3

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret i32 0		ret i32 0
}		}

		define i32 @foo_low_trip_count_icmp_sgt(i32 %bound) {
		; Simple loop with low tripcount and inequality test for exit.
		; Should not be vectorized.

		; CHECK-LABEL: @foo_low_trip_count_icmp_sgt(
		; CHECK-NOT: <{{[0-9]+}} x i8>

		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
		%0 = load i8, i8* %arrayidx, align 1
		%cmp1 = icmp eq i8 %0, 0
		%. = select i1 %cmp1, i8 2, i8 1
		store i8 %., i8* %arrayidx, align 1
		%inc = add nsw i32 %i.08, 1
		%exitcond = icmp sgt i32 %i.08, %bound
		br i1 %exitcond, label %for.end, label %for.body, !prof !1

		for.end: ; preds = %for.body
		ret i32 0
		}

		define i32 @const_low_trip_count() {
		; Simple loop with constant, small trip count and no profiling info.

		; CHECK-LABEL: @const_low_trip_count
		; CHECK-NOT: <{{[0-9]+}} x i8>

		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
		%0 = load i8, i8* %arrayidx, align 1
		%cmp1 = icmp eq i8 %0, 0
		%. = select i1 %cmp1, i8 2, i8 1
		store i8 %., i8* %arrayidx, align 1
		%inc = add nsw i32 %i.08, 1
		%exitcond = icmp slt i32 %i.08, 2
		br i1 %exitcond, label %for.body, label %for.end

		for.end: ; preds = %for.body
		ret i32 0
		}

		define i32 @const_large_trip_count() {
		; Simple loop with constant large trip count and no profiling info.

		; CHECK-LABEL: @const_large_trip_count
		; CHECK: <{{[0-9]+}} x i8>

		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
		%0 = load i8, i8* %arrayidx, align 1
		%cmp1 = icmp eq i8 %0, 0
		%. = select i1 %cmp1, i8 2, i8 1
		store i8 %., i8* %arrayidx, align 1
		%inc = add nsw i32 %i.08, 1
		%exitcond = icmp slt i32 %i.08, 1000
		br i1 %exitcond, label %for.body, label %for.end

		for.end: ; preds = %for.body
		ret i32 0
		}

		define i32 @const_small_trip_count_step() {
		; Simple loop with static, small trip count and no profiling info.

		; CHECK-LABEL: @const_small_trip_count_step
		; CHECK-NOT: <{{[0-9]+}} x i8>

		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
		%0 = load i8, i8* %arrayidx, align 1
		%cmp1 = icmp eq i8 %0, 0
		%. = select i1 %cmp1, i8 2, i8 1
		store i8 %., i8* %arrayidx, align 1
		%inc = add nsw i32 %i.08, 5
		%exitcond = icmp slt i32 %i.08, 10
		br i1 %exitcond, label %for.body, label %for.end

		for.end: ; preds = %for.body
		ret i32 0
		}

		define i32 @const_trip_over_profile() {
		; constant trip count takes precedence over profile data

		; CHECK-LABEL: @const_trip_over_profile
		; CHECK: <{{[0-9]+}} x i8>

		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
		%0 = load i8, i8* %arrayidx, align 1
		%cmp1 = icmp eq i8 %0, 0
		%. = select i1 %cmp1, i8 2, i8 1
		store i8 %., i8* %arrayidx, align 1
		%inc = add nsw i32 %i.08, 1
		%exitcond = icmp slt i32 %i.08, 1000
		br i1 %exitcond, label %for.body, label %for.end, !prof !1

		for.end: ; preds = %for.body
		ret i32 0
		}

!0 = !{!"function_entry_count", i64 100}		!0 = !{!"function_entry_count", i64 100}
!1 = !{!"branch_weights", i32 100, i32 0}		!1 = !{!"branch_weights", i32 100, i32 0}
!2 = !{!"branch_weights", i32 10, i32 90}		!2 = !{!"branch_weights", i32 10, i32 90}
!3 = !{!"branch_weights", i32 10, i32 10000}		!3 = !{!"branch_weights", i32 10, i32 10000}

This is an archive of the discontinued LLVM Phabricator instance.

Verify profile data confirms large loop trip counts.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 133029

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/tripcount.ll

This is an archive of the discontinued LLVM Phabricator instance.

Verify profile data confirms large loop trip counts.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 133029

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/tripcount.ll

Verify profile data confirms large loop trip counts.
ClosedPublic