This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorize.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
-
small-size.ll
-
tripcount.ll

Differential D32451

Improve profile-guided heuristics to use estimated trip count.
ClosedPublic

Authored by twoh on Apr 24 2017, 1:50 PM.

Download Raw Diff

Details

Reviewers

Ayal
mssimpso
mkuper
danielcdh
wmi
tejohnson

Commits

rG9083547ae317: Improve profile-guided heuristics to use estimated trip count.
rL305729: Improve profile-guided heuristics to use estimated trip count.

Summary

Existing heuristic uses the ratio between the function entry
frequency and the loop invocation frequency to find cold loops. However,
even if the loop executes frequently, if it has a small trip count per
each invocation, vectorization is not beneficial. On the other hand,
even if the loop invocation frequency is much smaller than the function
invocation frequency, if the trip count is high it is still beneficial
to vectorize the loop.

This patch uses estimated trip count computed from the profile metadata
as a primary metric to determine coldness of the loop. If the estimated
trip count cannot be computed, it falls back to the original heuristics.

Diff Detail

Repository: rL LLVM

Event Timeline

twoh created this revision.Apr 24 2017, 1:50 PM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptApr 24 2017, 1:50 PM

When I tried to play with this kind of thing, I got rather inconsistent performance results, because of things like a cold loop within a relatively hot loop.
Do you have any performance numbers for this?

@mkuper This helps with our internal benchmarks, but I don't have numbers with public benchmarks. What would be the good benchmarks to evaluate the vectorization? Thanks!

In D32451#735969, @twoh wrote:

@mkuper This helps with our internal benchmarks, but I don't have numbers with public benchmarks. What would be the good benchmarks to evaluate the vectorization? Thanks!

We don't really have a good public test-suite for this. SPECint 2006 is probably the lowest common denominator. :-\
As to your internal benchmarks, I think an anonymized table would be great, just to see what the overall effect is, e.g. something like:

benchmark 1, +3%
benchmark 2, -2%
benchmark 3 -1.5%
Everything else (~20 benchmarks) - no effect.

We've done this before, for our own benchmarks. This doesn't provide a lot of information, but still gives some idea of the impact. Would that work for you?

@mkuper I can do that, but the issue is that the vectorized part doesn't dominated the execution time of the benchmark (profile is pretty flat across multiple functions), so it may not be suitable to tell the impact of vectorization change. Let me think about a good way to measure the it. Thanks!

Use profile-based trip count estimation only when trip count cannot be computed.

So I evaluated loop vectorizer with https://github.com/malvanos/Video-SIMDBench, which is introduced in this paper: http://ieeexplore.ieee.org/document/7723550/. I first built it without profile data and ran linux perf with a command of

perf record -e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=400009/ -b ./bench

. Linux perf data has been processed with autofdo tool (https://github.com/google/autofdo) and provided to following compilations for the evaluation.

There are 220 benchmarks in the set and 18 among them are affected by this patch. And these 18 benchmarks are reduced to 6 functions (for example, all benchmarks named mc_chroma_?x? call mc_chroma with different parameters). The 6 functions are quant_4x4, quant_4x4x4, vbench_plane_copy_deinterleave_rgb_c, vbench_plane_copy_interleave_c, mc_chroma, and mc_luma. 3 of the the 6 functions are only vectorized with existing heuristic, while other 3 are only vectorized with new heuristic. Let's go through them in detail.

quant_4x4x4 and quant_4x4

These functions were only vectorized with the original heuristic, because the profile data tells that the target loop's estimated execution count is 1. However, actually the target loops in these functions have a fixed execution count, which can be computed with ScalarEvolution::getSmallConstantTripCount function. So once I fix the heuristic to use profile-guided estimated trip count only when ScalarEvolution::getSmallConstantTripCount fails to compute the actual trip count, both original and new heuristic generate same vectorized code.

vbench_plane_copy_deinterleave_rgb_c and vbench_plane_copy_interleave_c

These functions are only vectorized with the new heuristic. For both of them, estimated loop entry count is less than 20% of estimated function entry point. This is odd because if you look at the source code the loop is supposed to be invoked whenever the encompassing function is invoked.

Comparing performance, vbench_plane_copy_deinterleave_rgb_c is 6.2 percent slower with the new heuristic, which means vectorization harms the performance, while vbench_plane_copy_interleave_c is 5 times (not percent!) faster with the new heuristic, which means vectorization is super beneficial.

Function	Benchmark	Cycles (Original)	Cycles (New)	Difference(%)
vbench_plane_copy_deinterleave_rgb_c	plane_copy_deinterleave_rgb	11358	12064	+6.2%
vbench_plane_copy_interleave_c	plane_copy_interleave	14616	2815	-80.7%

This seems counter-intuitive considering that the actual operations performed in these functions are practically identical (it is just a copy of array elements). However, the actual trip count of the target loop can explain the difference. If you see the trip count histogram of vbench_plane_copy_deinterleave_rgb_c across multiple invocations, it is

{trip count: occurrence} = {1:13056, 2:13056, 5:13056, 8:13056, 64:26112, 66:13056, 126:13056, 132:13056, 476:13056}

,while for vbench_plane_copy_interleave_c, it is

{trip count: occurrence} = {1:32256, 4:32256, 10:32256, 16:6528, 128:13056, 132:32256, 252:32256, 264:32256, 952:32256}

So for vbench_plane_copy_deinterleave_rgb_c, low-trip count invocations offset the benefit of vectorization from high-trip count invocations, but for vbench_plane_copy_interleave_c, as high-trip count invocations dominate the execution time, the benefit of vectorization is clearly observed.

mc_chroma and mc_luma

Evaluation on these two functions clearly show the correlation between the trip count and the vectorization effect. There are 7 benchmarks associated with each function with different parameters. mc_chroma is only vectorized with the new heuristic, while mc_luma is only vectorized with the original heuristic. Below is the performance summary:

Function	Benchmark	Cycles (Original)	Cycles (New)	Difference(%)
mc_chroma	mc_chroma_2x2	530	586	+10.6%
	mc_chroma_2x4	840	921	+9.6
	mc_chroma_4x2	816	861	+5.5
	mc_chroma_4x4	1415	1465	+3.5
	mc_chroma_4x8	2610	2669	+2.3
	mc_chroma_8x4	2617	2676	+2.3
	mc_chroma_8x8	5016	5103	+1.7
mc_luma	mc_luma_4X4	552	489	-11.4
	mc_luma_4X8	892	818	-8.3
	mc_luma_8X4	753	730	-3.1
	mc_luma_8X8	1301	1300	-0.1
	mc_luma_8X16	2488	2437	-2.0
	mc_luma_16X8	2264	2487	+9.8
	mc_luma_16X16	4219	4701	+11.4

As the table shows, for low trip count invocations (roughly below 8x8), non-vectorized code (original for mc_chroma and new for mc_luma) performs better, but for high trip count invocations, vectorized code(original mc_luma) performs better. Here, again, I suspect that the profile numbers might be wrong: The profile estimates the trip count for mc_chroma as 153 while the trip count for mc_luma as 3, which results different vectorization decision with new heuristic. (LoopEntryCount/ColdEntryCount were 8/20 for mc_chroma and 270404/23906 for mc_luma, which affects the original heuristic's vectorization decision). I guess the profile numbers might be messed up during the loop transformations, but I don't have an evidence for that yet.

So with the evaluation results, I think trip count is a better metric than the invocation count to estimate the effectiveness of vectorization. Also, I think we need to be more conservative about loop-vectorizing low trip count loops, and need to improve the precision of profile data across the optimization passes.

Ping. Thanks!

Thanks Taewook for sharing the experimental results. What target was this run on?

lib/Transforms/Vectorize/LoopVectorize.cpp
7715 ↗	(On Diff #97003)	The original comment should in any case be updated to indicate that it's affecting the decision of optimizing-for-size cold loops.
7722 ↗	(On Diff #97003)	If loop is known to iterate less than TinyTripCounterVectorThreshold, we avoid vectorizing it altogether, rather than vectorizing it with code-size constraints; unless vectorizing it is explicitly forced. So should this const unsigned MaxTC = SE->getSmallConstantMaxTripCount(L); if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold) { ... } be extended to use profiling where static analysis fails, e.g., by inserting the following between the top two lines above: if (MaxTC == 0 && LoopVectorizeWithBlockFrequency) { auto EstimatedTC = getLoopEstimatedTripCount(L); if (EstimatedTC) MaxTC = *EstimatedTC; } ? OTOH, setting OptForSize to true when the trip count is unknown effectively prevents vectorization, because an epilog is needed.
test/Transforms/LoopVectorize/tripcount.ll
2 ↗	(On Diff #97003)	(As argued above, we expect loop not to be vectorized, rather than optimized for size.)
15 ↗	(On Diff #97003)	Better check instead that no vector types are generated, regardless of their size.

Addresing comments from Ayal.

Harbormaster completed remote builds in B6233: Diff 98126.May 7 2017, 9:49 PM

Thanks @Ayal for your comments! If the profile-based trip count checking is moved above the line

if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold)

, it wouldn't be possible to distinguish the case of static analysis fail to compute MaxTC from the case of profile-based trip count is actually 0. Also, as profile-based numbers are not as definitive as the number from static analysis, I think it might be worth to just optimize for size rather than disable the vectorization. As you mentioned in the comment, OptForSize is effectively same as disabling vectorization for now, but the algorithm for OptForSize case might be changed in the future.

I updated the test per your suggestion. And the target was x86-64 (sandybridge).

I'm inclined to treat TinyTripCount loops with associated reports the same for static and profile-based TripCounts, but am ok with setting OptForSize if profile-based-TripCount < TinyTripCount while aborting & reporting when static-TripCount < TinyTripCount, as suggested. The outcome is practically the same (see below). @mssimpso, @mkuper - do you have a preference here?

More comments:

In D32451#748358, @twoh wrote:
Thanks @Ayal for your comments! If the profile-based trip count checking is moved above the line
if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold)
, it wouldn't be possible to distinguish the case of static analysis fail to compute MaxTC from the case of profile-based trip count is actually 0. Also, as profile-based numbers are not as definitive as the number from static analysis, I think it might be worth to just optimize for size rather than disable the vectorization. As you mentioned in the comment, OptForSize is effectively same as disabling vectorization for now, but the algorithm for OptForSize case might be changed in the future.

Having EstimatedTC < TinyTripCountVectorThreshold should not imply "IsColdLoop". The loop may be hot.

Yes, getSmallConstantMaxTripCount() should also return an Optional<unsigned> (but not in this commit).

Can alternatively do

unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);
bool HasExpectedTC = (ExpectedTC > 0);

if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {
  auto EstimatedTC = getLoopEstimatedTripCount(L);
  if (EstimatedTC) {
    ExpectedTC = *EstimatedTC;
    HasExpectedTC = true;
  }
}

if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {
  ...
}

OTOH, setting OptForSize to true when the trip count is unknown effectively prevents vectorization, because an epilog is needed.

Just for completeness, if the trip count is unknown but known to be divisible by VF, loop could potentially be vectorized w/o an epilog.

Ayal added inline comments.May 11 2017, 2:00 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7728 ↗	(On Diff #98126)	When is the LoopEntryFreq < ColdEntryFreq criteria expected to kick in - when the loop latch has no associated frequencies (!EstimatedTC) but the function entry and loop preheader do? Looks like this criteria is effectively disabled, right?

I don't have a strong opinion about treating static trip count and profile-based trip count or not. I treated them separately because existing implementation sets OptForSize to true instead of returning false when it makes a profile-based decision. It would be great if someone can explain the reason behind that.

@Ayal Yes, 'LoopEntryFreq < ColdEntryFreq' criteria kicks in if the loop latch has no associated frequencies but the function entry and loop preheader do. I actually observed such case in one of our internal code because loop transformations mess up the branch profile info. However, although I kept the criteria, I'm not sure about the effectiveness of that criteria as I wrote in the summary. Maybe we keep the criteria but modify it to 'LoopEntryFreq == 0' to disable vectorization for obviously cold loops.

I'll wait for other opinions, and if there's no input, will update the code per Ayal's suggestions. Thanks!

In D32451#752865, @twoh wrote:

I don't have a strong opinion about treating static trip count and profile-based trip count or not. I treated them separately because existing implementation sets OptForSize to true instead of returning false when it makes a profile-based decision. It would be great if someone can explain the reason behind that.

A similar OptForSize constraint appears in MachineBlockPlacement:

// If the block is cold relative to the function entry don't waste space
// aligning it.

You may want to take a look at revision 200294 and discussion thread http://lists.llvm.org/pipermail/llvm-dev/2014-February/069932.html

@Ayal Yes, 'LoopEntryFreq < ColdEntryFreq' criteria kicks in if the loop latch has no associated frequencies but the function entry and loop preheader do. I actually observed such case in one of our internal code because loop transformations mess up the branch profile info. However, although I kept the criteria, I'm not sure about the effectiveness of that criteria as I wrote in the summary. Maybe we keep the criteria but modify it to 'LoopEntryFreq == 0' to disable vectorization for obviously cold loops.

It would probably be best to fix the messed-up branch-profile info. This also relates to your observations analyzing Video-SIMDBench. Could you provide reproducer(s)?

@Ayal Thanks for the pointer.

It would probably be best to fix the messed-up branch-profile info. This also relates to your observations analyzing Video-SIMDBench. Could you provide reproducer(s)?

Sorry but I've lost track of what the case was. But I'm planing to take a look at branch profile info propagation with Vidoe-SIMDBench and others to improve the precision.

Addressing Ayal's comments.

Thanks @twoh - this looks fine to me, and is pending @mkuper's approval following his original concern:

In D32451#735959, @mkuper wrote:

When I tried to play with this kind of thing, I got rather inconsistent performance results, because of things like a cold loop within a relatively hot loop.
Do you have any performance numbers for this?

Having lifted the OptForSize constraint when vectorizing cold loops, do you see any increase in compile-time / code-size when compiling with profile information?

A follow-up thought raised by this patch, which deserves a separate discussion/patch: should loops whose trip-count is smaller than TinyTripCountVectorThreshold, statically or by profile, be vectorized under OptForSize constraint rather than non-vectorized? The OptForSize constraint implies little if any overheads outside of the vectorized loop body, so the current cost estimate of the vectorized-vs-scalar loop body should hopefully be more/sufficiently accurate.

Comparing the existing implementation and this patch, I don't observe noticeable compile time different with Video-SIMDBench. I compiled the benchmark suite 3 times each and the median was 27.12 sec vs 27.55 sec, while the average was 27.96sec vs 27.89 sec. There were some code size differences, but it is simply more vectorization results bigger code size. There's no difference between OptForSize and non-vectorized.

I observe that branch frequency metadata handling has been improve since I first submitted this diff, which makes difference for some benchmarks. For example, mc_chroma, which was not vectorized with existing implementation while vectorized with this patch, is now vectorized with the original implementation but not with this patch. However, branch frequency metadata are still not perfect, and actually I was able to find 4 loops in mc_luma that whose loop entry frequency information is available but not the estimated trip count. These loops are vectorized if we make a vectorization decision only based on a trip count, but not vectorized if we consider loop entry frequency if trip count is not available, because their entry frequency is smaller than the threshold. Also, there are loops whose trip counts are underestimated and miss the vectorization opportunity. By chance, these loops result better performance with existing implementation because the loop entry frequency is higher than the cold entry frequency.

In summary, I don't see much difference between OptForSize and non-vectorized, but see the potential of better vectorization decision with more precise profile info for this patch.

Ping. Thanks!

This LGTM in terms of being "the right thing to do", but I'm really not sure whether the branch information we have right now is precise enough right now to enable this, especially when using sampling-based fdo.
wmi/dehao, any chance you could check if this causes regressions for us?

danielcdh added inline comments.Jun 2 2017, 12:44 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7683 ↗	(On Diff #99616)	Can this code be refactor with the loop unroller to make something like "isFlatLoop"?

twoh added inline comments.Jun 2 2017, 12:54 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7683 ↗	(On Diff #99616)	I'm afraid I couldn't understand your suggestion. Can you please provide a little bit more of details?

danielcdh added inline comments.Jun 2 2017, 1:13 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7683 ↗	(On Diff #99616)	My point is, loop unroller and loop vectorizer should have the same logic: if the loop TC is small, it should not be unrolled/vectorized. My suggestion is, if the logic is the same, we'd better refactor it out so that we are consistent making unroll/vectorize decisions. Of cause, the logic to check TinyTripCountVectorThreshold was not introduced in this patch. But with your change, the profile is used, making it very close to the logic used in loop unroller. So I suggest refactoring them out to a utility function.

twoh added inline comments.Jun 4 2017, 10:10 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
7683 ↗	(On Diff #99616)	I see. Yes, that totally makes sense, but I think the refactoring can be a separate patch once this one is checked in.

Ping. Thanks!

This has been conceptually approved, but is pending:

wmi/dehao, any chance you could check if this causes regressions for us?

Any progress? The follow-up thought raised above deserves a follow-up patch, also pending.

Should/can the default for LoopVectorizeWithBlockFrequency be switched to true with this patch? Or will that be evaluated separately?

@tejohnson Sorry for the late reply. I was out of internet for days. I have no objection to set the default value to true, but prefer to have it as a separate patch so that we can track the impact on performance of each chance more easily.

In D32451#783559, @twoh wrote:

@tejohnson Sorry for the late reply. I was out of internet for days. I have no objection to set the default value to true, but prefer to have it as a separate patch so that we can track the impact on performance of each chance more easily.

Sounds good, that would be great to switch to true as a follow on patch. Looks like the comment that used to explain why this was off by default shouldn't apply anymore. Which leads me to one request before submitting, noted below.

Since multiple people have lgtm'ed this on phab and/or the mailing list, I am marking this accepted in phab.

lib/Transforms/Vectorize/LoopVectorize.cpp
7911 ↗	(On Diff #99616)	Please remove the above code to compute and set ColdEntryFreq, which can also be removed from the LoopVectorizePass struct.

This revision is now accepted and ready to land.Jun 19 2017, 6:01 AM

Addressing comments from @tejohnson. Thanks!

Closed by commit rL305729: Improve profile-guided heuristics to use estimated trip count. (authored by twoh). · Explain WhyJun 19 2017, 11:49 AM

This revision was automatically updated to reflect the committed changes.

Ayal mentioned this in D34373: [LV] Optimize for size when vectorizing loops with tiny trip count.Jun 19 2017, 4:49 PM

Diffusion mentioned this in rL306803: [LV] Optimize for size when vectorizing loops with tiny trip count.Jun 30 2017, 1:03 AM

lebedev.ri mentioned this in D114779: [LV] Remove `LoopVectorizationCostModel::useEmulatedMaskMemRefHack()`.Dec 1 2021, 1:16 PM

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Transforms/

Vectorize/

LoopVectorize.h

2 lines

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

47 lines

test/

Transforms/

LoopVectorize/

X86/

small-size.ll

26 lines

tripcount.ll

91 lines

Diff 103087

llvm/trunk/include/llvm/Transforms/Vectorize/LoopVectorize.h

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	struct LoopVectorizePass : public PassInfoMixin<LoopVectorizePass> {
BlockFrequencyInfo *BFI;		BlockFrequencyInfo *BFI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
DemandedBits *DB;		DemandedBits *DB;
AliasAnalysis *AA;		AliasAnalysis *AA;
AssumptionCache *AC;		AssumptionCache *AC;
std::function<const LoopAccessInfo &(Loop &)> *GetLAA;		std::function<const LoopAccessInfo &(Loop &)> *GetLAA;
OptimizationRemarkEmitter *ORE;		OptimizationRemarkEmitter *ORE;

BlockFrequency ColdEntryFreq;

PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);		PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);

// Shim for old PM.		// Shim for old PM.
bool runImpl(Function &F, ScalarEvolution &SE_, LoopInfo &LI_,		bool runImpl(Function &F, ScalarEvolution &SE_, LoopInfo &LI_,
TargetTransformInfo &TTI_, DominatorTree &DT_,		TargetTransformInfo &TTI_, DominatorTree &DT_,
BlockFrequencyInfo &BFI_, TargetLibraryInfo *TLI_,		BlockFrequencyInfo &BFI_, TargetLibraryInfo *TLI_,
DemandedBits &DB_, AliasAnalysis &AA_, AssumptionCache &AC_,		DemandedBits &DB_, AliasAnalysis &AA_, AssumptionCache &AC_,
std::function<const LoopAccessInfo &(Loop &)> &GetLAA_,		std::function<const LoopAccessInfo &(Loop &)> &GetLAA_,
OptimizationRemarkEmitter &ORE);		OptimizationRemarkEmitter &ORE);

bool processLoop(Loop *L);		bool processLoop(Loop *L);
};		};
}		}

#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZE_H		#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZE_H

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,696 Lines • ▼ Show 20 Lines	if (hasIrregularType(ScalarTy, DL, VF))
return false;		return false;

return true;		return true;
}		}

void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) {		void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) {

// We should not collect Uniforms more than once per VF. Right now,		// We should not collect Uniforms more than once per VF. Right now,
// this function is called from collectUniformsAndScalars(), which		// this function is called from collectUniformsAndScalars(), which
// already does this check. Collecting Uniforms for VF=1 does not make any		// already does this check. Collecting Uniforms for VF=1 does not make any
// sense.		// sense.

assert(VF >= 2 && !Uniforms.count(VF) &&		assert(VF >= 2 && !Uniforms.count(VF) &&
"This function should not be visited twice for the same VF");		"This function should not be visited twice for the same VF");

// Visit the list of Uniforms. If we'll not find any uniform value, we'll		// Visit the list of Uniforms. If we'll not find any uniform value, we'll
// not analyze again. Uniforms.count(VF) will return 1.		// not analyze again. Uniforms.count(VF) will return 1.
Uniforms[VF].clear();		Uniforms[VF].clear();

// We now know that the loop is vectorizable!		// We now know that the loop is vectorizable!
// Collect instructions inside the loop that will remain uniform after		// Collect instructions inside the loop that will remain uniform after
// vectorization.		// vectorization.

// Global values, params and instructions outside of current loop are out of		// Global values, params and instructions outside of current loop are out of
▲ Show 20 Lines • Show All 262 Lines • ▼ Show 20 Lines	void InterleavedAccessInfo::collectConstStrideAccesses(
for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))		for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
for (auto &I : *BB) {		for (auto &I : *BB) {
auto *LI = dyn_cast<LoadInst>(&I);		auto *LI = dyn_cast<LoadInst>(&I);
auto *SI = dyn_cast<StoreInst>(&I);		auto *SI = dyn_cast<StoreInst>(&I);
if (!LI && !SI)		if (!LI && !SI)
continue;		continue;

Value *Ptr = getPointerOperand(&I);		Value *Ptr = getPointerOperand(&I);
// We don't check wrapping here because we don't know yet if Ptr will be		// We don't check wrapping here because we don't know yet if Ptr will be
// part of a full group or a group with gaps. Checking wrapping for all		// part of a full group or a group with gaps. Checking wrapping for all
// pointers (even those that end up in groups with no gaps) will be overly		// pointers (even those that end up in groups with no gaps) will be overly
// conservative. For full groups, wrapping should be ok since if we would		// conservative. For full groups, wrapping should be ok since if we would
// wrap around the address space we would do a memory access at nullptr		// wrap around the address space we would do a memory access at nullptr
// even without the transformation. The wrapping checks are therefore		// even without the transformation. The wrapping checks are therefore
// deferred until after we've formed the interleaved groups.		// deferred until after we've formed the interleaved groups.
int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,		int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,
/Assume=/true, /ShouldCheckWrap=/false);		/Assume=/true, /ShouldCheckWrap=/false);

const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);		const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);
PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());		PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
▲ Show 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /Assume=/false,
DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "		DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
"first group member potentially pointer-wrapping.\n");		"first group member potentially pointer-wrapping.\n");
releaseGroup(Group);		releaseGroup(Group);
continue;		continue;
}		}
Instruction *LastMember = Group->getMember(Group->getFactor() - 1);		Instruction *LastMember = Group->getMember(Group->getFactor() - 1);
if (LastMember) {		if (LastMember) {
Value *LastMemberPtr = getPointerOperand(LastMember);		Value *LastMemberPtr = getPointerOperand(LastMember);
if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /Assume=/false,		if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /Assume=/false,
/ShouldCheckWrap=/true)) {		/ShouldCheckWrap=/true)) {
DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "		DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
"last group member potentially pointer-wrapping.\n");		"last group member potentially pointer-wrapping.\n");
releaseGroup(Group);		releaseGroup(Group);
}		}
} else {		} else {
// Case 3: A non-reversed interleaved load group with gaps: We need		// Case 3: A non-reversed interleaved load group with gaps: We need
// to execute at least one scalar epilogue iteration. This will ensure		// to execute at least one scalar epilogue iteration. This will ensure
// we don't speculatively access memory out-of-bounds. We only need		// we don't speculatively access memory out-of-bounds. We only need
// to look for a member at index factor - 1, since every group must have		// to look for a member at index factor - 1, since every group must have
// a member at index zero.		// a member at index zero.
if (Group->isReverse()) {		if (Group->isReverse()) {
releaseGroup(Group);		releaseGroup(Group);
continue;		continue;
}		}
DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");		DEBUG(dbgs() << "LV: Interleaved group requires epilogue iteration.\n");
RequiresScalarEpilogue = true;		RequiresScalarEpilogue = true;
}		}
▲ Show 20 Lines • Show All 1,518 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */

if (!Hints.allowVectorization(F, L, AlwaysVectorize)) {		if (!Hints.allowVectorization(F, L, AlwaysVectorize)) {
DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");		DEBUG(dbgs() << "LV: Loop hints prevent vectorization.\n");
return false;		return false;
}		}

// Check the loop for a trip count threshold:		// Check the loop for a trip count threshold:
// do not vectorize loops with a tiny trip count.		// do not vectorize loops with a tiny trip count.
const unsigned MaxTC = SE->getSmallConstantMaxTripCount(L);		unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);
if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold) {		bool HasExpectedTC = (ExpectedTC > 0);

		if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {
		auto EstimatedTC = getLoopEstimatedTripCount(L);
		if (EstimatedTC) {
		ExpectedTC = *EstimatedTC;
		HasExpectedTC = true;
		}
		}

		if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {
DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is not worth vectorizing.");		<< "This loop is not worth vectorizing.");
if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)		if (Hints.getForce() == LoopVectorizeHints::FK_Enabled)
DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");		DEBUG(dbgs() << " But vectorizing was explicitly forced.\n");
else {		else {
DEBUG(dbgs() << "\n");		DEBUG(dbgs() << "\n");
ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),		ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),
"NotBeneficial", L)		"NotBeneficial", L)
Show All 15 Lines	if (!LVL.canVectorize()) {
return false;		return false;
}		}

// Check the function attributes to find out if this function should be		// Check the function attributes to find out if this function should be
// optimized for size.		// optimized for size.
bool OptForSize =		bool OptForSize =
Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();		Hints.getForce() != LoopVectorizeHints::FK_Enabled && F->optForSize();

// Compute the weighted frequency of this loop being executed and see if it
// is less than 20% of the function entry baseline frequency. Note that we
// always have a canonical loop here because we think we can vectorize.
// FIXME: This is hidden behind a flag due to pervasive problems with
// exactly what block frequency models.
if (LoopVectorizeWithBlockFrequency) {
BlockFrequency LoopEntryFreq = BFI->getBlockFreq(L->getLoopPreheader());
if (Hints.getForce() != LoopVectorizeHints::FK_Enabled &&
LoopEntryFreq < ColdEntryFreq)
OptForSize = true;
}

// Check the function attributes to see if implicit floats are allowed.		// Check the function attributes to see if implicit floats are allowed.
// FIXME: This check doesn't seem possibly correct -- what if the loop is		// FIXME: This check doesn't seem possibly correct -- what if the loop is
// an integer loop and the vector instructions selected are purely integer		// an integer loop and the vector instructions selected are purely integer
// vector instructions?		// vector instructions?
if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {		if (F->hasFnAttribute(Attribute::NoImplicitFloat)) {
DEBUG(dbgs() << "LV: Can't vectorize when the NoImplicitFloat"		DEBUG(dbgs() << "LV: Can't vectorize when the NoImplicitFloat"
"attribute is used.\n");		"attribute is used.\n");
ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),		ORE->emit(createMissedAnalysis(Hints.vectorizeAnalysisPassName(),
▲ Show 20 Lines • Show All 165 Lines • ▼ Show 20 Lines	bool LoopVectorizePass::runImpl(
BFI = &BFI_;		BFI = &BFI_;
TLI = TLI_;		TLI = TLI_;
AA = &AA_;		AA = &AA_;
AC = &AC_;		AC = &AC_;
GetLAA = &GetLAA_;		GetLAA = &GetLAA_;
DB = &DB_;		DB = &DB_;
ORE = &ORE_;		ORE = &ORE_;

// Compute some weights outside of the loop over the loops. Compute this
// using a BranchProbability to re-use its scaling math.
const BranchProbability ColdProb(1, 5); // 20%
ColdEntryFreq = BlockFrequency(BFI->getEntryFreq()) * ColdProb;

// Don't attempt if		// Don't attempt if
// 1. the target claims to have no vector registers, and		// 1. the target claims to have no vector registers, and
// 2. interleaving won't help ILP.		// 2. interleaving won't help ILP.
//		//
// The second condition is necessary because, even if the target has no		// The second condition is necessary because, even if the target has no
// vector registers, loop vectorization may still enable scalar		// vector registers, loop vectorization may still enable scalar
// interleaving.		// interleaving.
if (!TTI->getNumberOfRegisters(true) && TTI->getMaxInterleaveFactor(1) < 2)		if (!TTI->getNumberOfRegisters(true) && TTI->getMaxInterleaveFactor(1) < 2)
▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll

Show First 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	.lr.ph: ; preds = %0, %.lr.ph
store i32 %4, i32* %.014, align 16		store i32 %4, i32* %.014, align 16
%6 = icmp eq i32 %2, 0		%6 = icmp eq i32 %2, 0
br i1 %6, label %._crit_edge, label %.lr.ph		br i1 %6, label %._crit_edge, label %.lr.ph

._crit_edge: ; preds = %.lr.ph, %0		._crit_edge: ; preds = %.lr.ph, %0
ret void		ret void
}		}

; N is unknown, we need a tail. Can't vectorize because the loop is cold.
;CHECK-LABEL: @example4(
;CHECK-NOT: <4 x i32>
;CHECK: ret void
define void @example4(i32 %n, i32* noalias nocapture %p, i32* noalias nocapture %q) {
%1 = icmp eq i32 %n, 0
br i1 %1, label %._crit_edge, label %.lr.ph, !prof !0

.lr.ph: ; preds = %0, %.lr.ph
%.05 = phi i32 [ %2, %.lr.ph ], [ %n, %0 ]
%.014 = phi i32* [ %5, %.lr.ph ], [ %p, %0 ]
%.023 = phi i32* [ %3, %.lr.ph ], [ %q, %0 ]
%2 = add nsw i32 %.05, -1
%3 = getelementptr inbounds i32, i32* %.023, i64 1
%4 = load i32, i32* %.023, align 16
%5 = getelementptr inbounds i32, i32* %.014, i64 1
store i32 %4, i32* %.014, align 16
%6 = icmp eq i32 %2, 0
br i1 %6, label %._crit_edge, label %.lr.ph

._crit_edge: ; preds = %.lr.ph, %0
ret void
}

!0 = !{!"branch_weights", i32 64, i32 4}

; We can't vectorize this one because we need a runtime ptr check.		; We can't vectorize this one because we need a runtime ptr check.
;CHECK-LABEL: @example23(		;CHECK-LABEL: @example23(
;CHECK-NOT: <4 x i32>		;CHECK-NOT: <4 x i32>
;CHECK: ret void		;CHECK: ret void
define void @example23(i16* nocapture %src, i32* nocapture %dst) optsize {		define void @example23(i16* nocapture %src, i32* nocapture %dst) optsize {
br label %1		br label %1

; <label>:1 ; preds = %1, %0		; <label>:1 ; preds = %1, %0
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/tripcount.ll

				; This test verifies that the loop vectorizer will not vectorizes low trip count
				; loops that require runtime checks (Trip count is computed with profile info).
				; REQUIRES: asserts
				; RUN: opt < %s -loop-vectorize -loop-vectorize-with-block-frequency -S \| FileCheck %s

				target datalayout = "E-m:e-p:32:32-i64:32-f64:32:64-a:0:32-n32-S128"

				@tab = common global [32 x i8] zeroinitializer, align 1

				define i32 @foo_low_trip_count1(i32 %bound) {
				; Simple loop with low tripcount. Should not be vectorized.

				; CHECK-LABEL: @foo_low_trip_count1(
				; CHECK-NOT: <{{[0-9]+}} x i8>

				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
				%0 = load i8, i8* %arrayidx, align 1
				%cmp1 = icmp eq i8 %0, 0
				%. = select i1 %cmp1, i8 2, i8 1
				store i8 %., i8* %arrayidx, align 1
				%inc = add nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %i.08, %bound
				br i1 %exitcond, label %for.end, label %for.body, !prof !1

				for.end: ; preds = %for.body
				ret i32 0
				}

				define i32 @foo_low_trip_count2(i32 %bound) !prof !0 {
				; The loop has a same invocation count with the function, but has a low
				; trip_count per invocation and not worth to vectorize.

				; CHECK-LABEL: @foo_low_trip_count2(
				; CHECK-NOT: <{{[0-9]+}} x i8>

				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i.08 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
				%0 = load i8, i8* %arrayidx, align 1
				%cmp1 = icmp eq i8 %0, 0
				%. = select i1 %cmp1, i8 2, i8 1
				store i8 %., i8* %arrayidx, align 1
				%inc = add nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %i.08, %bound
				br i1 %exitcond, label %for.end, label %for.body, !prof !1

				for.end: ; preds = %for.body
				ret i32 0
				}

				define i32 @foo_low_trip_count3(i1 %cond, i32 %bound) !prof !0 {
				; The loop has low invocation count compare to the function invocation count,
				; but has a high trip count per invocation. Vectorize it.

				; CHECK-LABEL: @foo_low_trip_count3(
				; CHECK: vector.body:

				entry:
				br i1 %cond, label %for.preheader, label %for.end, !prof !2

				for.preheader:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i.08 = phi i32 [ 0, %for.preheader ], [ %inc, %for.body ]
				%arrayidx = getelementptr inbounds [32 x i8], [32 x i8]* @tab, i32 0, i32 %i.08
				%0 = load i8, i8* %arrayidx, align 1
				%cmp1 = icmp eq i8 %0, 0
				%. = select i1 %cmp1, i8 2, i8 1
				store i8 %., i8* %arrayidx, align 1
				%inc = add nsw i32 %i.08, 1
				%exitcond = icmp eq i32 %i.08, %bound
				br i1 %exitcond, label %for.end, label %for.body, !prof !3

				for.end: ; preds = %for.body
				ret i32 0
				}


				!0 = !{!"function_entry_count", i64 100}
				!1 = !{!"branch_weights", i32 100, i32 0}
				!2 = !{!"branch_weights", i32 10, i32 90}
				!3 = !{!"branch_weights", i32 10, i32 10000}

This is an archive of the discontinued LLVM Phabricator instance.

Improve profile-guided heuristics to use estimated trip count.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 103087

llvm/trunk/include/llvm/Transforms/Vectorize/LoopVectorize.h

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/trunk/test/Transforms/LoopVectorize/X86/small-size.ll

llvm/trunk/test/Transforms/LoopVectorize/tripcount.ll

Improve profile-guided heuristics to use estimated trip count.
ClosedPublic