Improve profile-guided heuristics to use estimated trip count.
Needs ReviewPublic

Authored by twoh on Mon, Apr 24, 1:50 PM.

Details

Summary

Existing heuristic uses the ratio between the function entry
frequency and the loop invocation frequency to find cold loops. However,
even if the loop executes frequently, if it has a small trip count per
each invocation, vectorization is not beneficial. On the other hand,
even if the loop invocation frequency is much smaller than the function
invocation frequency, if the trip count is high it is still beneficial
to vectorize the loop.

This patch uses estimated trip count computed from the profile metadata
as a primary metric to determine coldness of the loop. If the estimated
trip count cannot be computed, it falls back to the original heuristics.

twoh created this revision.Mon, Apr 24, 1:50 PM

When I tried to play with this kind of thing, I got rather inconsistent performance results, because of things like a cold loop within a relatively hot loop.
Do you have any performance numbers for this?

twoh added a comment.Mon, Apr 24, 2:00 PM

@mkuper This helps with our internal benchmarks, but I don't have numbers with public benchmarks. What would be the good benchmarks to evaluate the vectorization? Thanks!

In D32451#735969, @twoh wrote:

@mkuper This helps with our internal benchmarks, but I don't have numbers with public benchmarks. What would be the good benchmarks to evaluate the vectorization? Thanks!

We don't really have a good public test-suite for this. SPECint 2006 is probably the lowest common denominator. :-\
As to your internal benchmarks, I think an anonymized table would be great, just to see what the overall effect is, e.g. something like:

benchmark 1, +3%
benchmark 2, -2%
benchmark 3 -1.5%
Everything else (~20 benchmarks) - no effect.

We've done this before, for our own benchmarks. This doesn't provide a lot of information, but still gives some idea of the impact. Would that work for you?

twoh added a comment.Mon, Apr 24, 2:17 PM

@mkuper I can do that, but the issue is that the vectorized part doesn't dominated the execution time of the benchmark (profile is pretty flat across multiple functions), so it may not be suitable to tell the impact of vectorization change. Let me think about a good way to measure the it. Thanks!

twoh updated this revision to Diff 97003.Thu, Apr 27, 2:35 PM

Use profile-based trip count estimation only when trip count cannot be computed.

twoh added a comment.Thu, Apr 27, 2:36 PM

So I evaluated loop vectorizer with https://github.com/malvanos/Video-SIMDBench, which is introduced in this paper: http://ieeexplore.ieee.org/document/7723550/. I first built it without profile data and ran linux perf with a command of

perf record -e cpu/event=0xc4,umask=0x20,name=br_inst_retired_near_taken,period=400009/ -b ./bench

. Linux perf data has been processed with autofdo tool (https://github.com/google/autofdo) and provided to following compilations for the evaluation.

There are 220 benchmarks in the set and 18 among them are affected by this patch. And these 18 benchmarks are reduced to 6 functions (for example, all benchmarks named mc_chroma_?x? call mc_chroma with different parameters). The 6 functions are quant_4x4, quant_4x4x4, vbench_plane_copy_deinterleave_rgb_c, vbench_plane_copy_interleave_c, mc_chroma, and mc_luma. 3 of the the 6 functions are only vectorized with existing heuristic, while other 3 are only vectorized with new heuristic. Let's go through them in detail.


  • quant_4x4x4 and quant_4x4

These functions were only vectorized with the original heuristic, because the profile data tells that the target loop's estimated execution count is 1. However, actually the target loops in these functions have a fixed execution count, which can be computed with ScalarEvolution::getSmallConstantTripCount function. So once I fix the heuristic to use profile-guided estimated trip count only when ScalarEvolution::getSmallConstantTripCount fails to compute the actual trip count, both original and new heuristic generate same vectorized code.

  • vbench_plane_copy_deinterleave_rgb_c and vbench_plane_copy_interleave_c

These functions are only vectorized with the new heuristic. For both of them, estimated loop entry count is less than 20% of estimated function entry point. This is odd because if you look at the source code the loop is supposed to be invoked whenever the encompassing function is invoked.

Comparing performance, vbench_plane_copy_deinterleave_rgb_c is 6.2 percent slower with the new heuristic, which means vectorization harms the performance, while vbench_plane_copy_interleave_c is 5 times (not percent!) faster with the new heuristic, which means vectorization is super beneficial.

FunctionBenchmarkCycles (Original)Cycles (New)Difference(%)
vbench_plane_copy_deinterleave_rgb_cplane_copy_deinterleave_rgb1135812064+6.2%
vbench_plane_copy_interleave_cplane_copy_interleave146162815-80.7%

This seems counter-intuitive considering that the actual operations performed in these functions are practically identical (it is just a copy of array elements). However, the actual trip count of the target loop can explain the difference. If you see the trip count histogram of vbench_plane_copy_deinterleave_rgb_c across multiple invocations, it is

{trip count: occurrence} = {1:13056, 2:13056, 5:13056, 8:13056, 64:26112, 66:13056, 126:13056, 132:13056, 476:13056}

,while for vbench_plane_copy_interleave_c, it is

{trip count: occurrence} = {1:32256, 4:32256, 10:32256, 16:6528, 128:13056, 132:32256, 252:32256, 264:32256, 952:32256}

So for vbench_plane_copy_deinterleave_rgb_c, low-trip count invocations offset the benefit of vectorization from high-trip count invocations, but for vbench_plane_copy_interleave_c, as high-trip count invocations dominate the execution time, the benefit of vectorization is clearly observed.

  • mc_chroma and mc_luma

Evaluation on these two functions clearly show the correlation between the trip count and the vectorization effect. There are 7 benchmarks associated with each function with different parameters. mc_chroma is only vectorized with the new heuristic, while mc_luma is only vectorized with the original heuristic. Below is the performance summary:

FunctionBenchmarkCycles (Original)Cycles (New)Difference(%)
mc_chromamc_chroma_2x2530586+10.6%
mc_chroma_2x4840921+9.6
mc_chroma_4x2816861+5.5
mc_chroma_4x414151465+3.5
mc_chroma_4x826102669+2.3
mc_chroma_8x426172676+2.3
mc_chroma_8x850165103+1.7
mc_lumamc_luma_4X4552489-11.4
mc_luma_4X8892818-8.3
mc_luma_8X4753730-3.1
mc_luma_8X813011300-0.1
mc_luma_8X1624882437-2.0
mc_luma_16X822642487+9.8
mc_luma_16X1642194701+11.4

As the table shows, for low trip count invocations (roughly below 8x8), non-vectorized code (original for mc_chroma and new for mc_luma) performs better, but for high trip count invocations, vectorized code(original mc_luma) performs better. Here, again, I suspect that the profile numbers might be wrong: The profile estimates the trip count for mc_chroma as 153 while the trip count for mc_luma as 3, which results different vectorization decision with new heuristic. (LoopEntryCount/ColdEntryCount were 8/20 for mc_chroma and 270404/23906 for mc_luma, which affects the original heuristic's vectorization decision). I guess the profile numbers might be messed up during the loop transformations, but I don't have an evidence for that yet.


So with the evaluation results, I think trip count is a better metric than the invocation count to estimate the effectiveness of vectorization. Also, I think we need to be more conservative about loop-vectorizing low trip count loops, and need to improve the precision of profile data across the optimization passes.

twoh added a comment.Fri, May 5, 8:46 AM

Ping. Thanks!

Ayal added a comment.Fri, May 5, 4:45 PM

Thanks Taewook for sharing the experimental results. What target was this run on?

lib/Transforms/Vectorize/LoopVectorize.cpp
7728

The original comment should in any case be updated to indicate that it's affecting the decision of optimizing-for-size cold loops.

7728

If loop is known to iterate less than TinyTripCounterVectorThreshold, we avoid vectorizing it altogether, rather than vectorizing it with code-size constraints; unless vectorizing it is explicitly forced.

So should this

const unsigned MaxTC = SE->getSmallConstantMaxTripCount(L);
if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold) {
  ...
}

be extended to use profiling where static analysis fails, e.g., by inserting the following between the top two lines above:

if (MaxTC == 0 && LoopVectorizeWithBlockFrequency) {
  auto EstimatedTC = getLoopEstimatedTripCount(L);
  if (EstimatedTC)
    MaxTC = *EstimatedTC;
}

?

OTOH, setting OptForSize to true when the trip count is unknown effectively prevents vectorization, because an epilog is needed.

test/Transforms/LoopVectorize/tripcount.ll
3

(As argued above, we expect loop not to be vectorized, rather than optimized for size.)

16

Better check instead that no vector types are generated, regardless of their size.

twoh updated this revision to Diff 98126.Sun, May 7, 9:49 PM

Addresing comments from Ayal.

twoh added a comment.EditedSun, May 7, 9:59 PM

Thanks @Ayal for your comments! If the profile-based trip count checking is moved above the line

if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold)

, it wouldn't be possible to distinguish the case of static analysis fail to compute MaxTC from the case of profile-based trip count is actually 0. Also, as profile-based numbers are not as definitive as the number from static analysis, I think it might be worth to just optimize for size rather than disable the vectorization. As you mentioned in the comment, OptForSize is effectively same as disabling vectorization for now, but the algorithm for OptForSize case might be changed in the future.

I updated the test per your suggestion. And the target was x86-64 (sandybridge).

Ayal added a comment.Thu, May 11, 1:25 PM

I'm inclined to treat TinyTripCount loops with associated reports the same for static and profile-based TripCounts, but am ok with setting OptForSize if profile-based-TripCount < TinyTripCount while aborting & reporting when static-TripCount < TinyTripCount, as suggested. The outcome is practically the same (see below). @mssimpso, @mkuper - do you have a preference here?

More comments:

In D32451#748358, @twoh wrote:

Thanks @Ayal for your comments! If the profile-based trip count checking is moved above the line

if (MaxTC > 0u && MaxTC < TinyTripCountVectorThreshold)

, it wouldn't be possible to distinguish the case of static analysis fail to compute MaxTC from the case of profile-based trip count is actually 0. Also, as profile-based numbers are not as definitive as the number from static analysis, I think it might be worth to just optimize for size rather than disable the vectorization. As you mentioned in the comment, OptForSize is effectively same as disabling vectorization for now, but the algorithm for OptForSize case might be changed in the future.

Having EstimatedTC < TinyTripCountVectorThreshold should not imply "IsColdLoop". The loop may be hot.

Yes, getSmallConstantMaxTripCount() should also return an Optional<unsigned> (but not in this commit).

Can alternatively do

unsigned ExpectedTC = SE->getSmallConstantMaxTripCount(L);
bool HasExpectedTC = (ExpectedTC > 0);

if (!HasExpectedTC && LoopVectorizeWithBlockFrequency) {
  auto EstimatedTC = getLoopEstimatedTripCount(L);
  if (EstimatedTC) {
    ExpectedTC = *EstimatedTC;
    HasExpectedTC = true;
  }
}

if (HasExpectedTC && ExpectedTC < TinyTripCountVectorThreshold) {
  ...
}

OTOH, setting OptForSize to true when the trip count is unknown effectively prevents vectorization, because an epilog is needed.

Just for completeness, if the trip count is unknown but known to be divisible by VF, loop could potentially be vectorized w/o an epilog.

Ayal added inline comments.Thu, May 11, 2:00 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
7729

When is the LoopEntryFreq < ColdEntryFreq criteria expected to kick in - when the loop latch has no associated frequencies (!EstimatedTC) but the function entry and loop preheader do? Looks like this criteria is effectively disabled, right?

twoh added a comment.Thu, May 11, 10:05 PM

I don't have a strong opinion about treating static trip count and profile-based trip count or not. I treated them separately because existing implementation sets OptForSize to true instead of returning false when it makes a profile-based decision. It would be great if someone can explain the reason behind that.

@Ayal Yes, 'LoopEntryFreq < ColdEntryFreq' criteria kicks in if the loop latch has no associated frequencies but the function entry and loop preheader do. I actually observed such case in one of our internal code because loop transformations mess up the branch profile info. However, although I kept the criteria, I'm not sure about the effectiveness of that criteria as I wrote in the summary. Maybe we keep the criteria but modify it to 'LoopEntryFreq == 0' to disable vectorization for obviously cold loops.

I'll wait for other opinions, and if there's no input, will update the code per Ayal's suggestions. Thanks!

Ayal added a comment.Sat, May 13, 2:31 PM
In D32451#752865, @twoh wrote:

I don't have a strong opinion about treating static trip count and profile-based trip count or not. I treated them separately because existing implementation sets OptForSize to true instead of returning false when it makes a profile-based decision. It would be great if someone can explain the reason behind that.

A similar OptForSize constraint appears in MachineBlockPlacement:

// If the block is cold relative to the function entry don't waste space
// aligning it.

You may want to take a look at revision 200294 and discussion thread http://lists.llvm.org/pipermail/llvm-dev/2014-February/069932.html

@Ayal Yes, 'LoopEntryFreq < ColdEntryFreq' criteria kicks in if the loop latch has no associated frequencies but the function entry and loop preheader do. I actually observed such case in one of our internal code because loop transformations mess up the branch profile info. However, although I kept the criteria, I'm not sure about the effectiveness of that criteria as I wrote in the summary. Maybe we keep the criteria but modify it to 'LoopEntryFreq == 0' to disable vectorization for obviously cold loops.

It would probably be best to fix the messed-up branch-profile info. This also relates to your observations analyzing Video-SIMDBench. Could you provide reproducer(s)?

twoh added a comment.Sat, May 13, 10:38 PM

@Ayal Thanks for the pointer.

It would probably be best to fix the messed-up branch-profile info. This also relates to your observations analyzing Video-SIMDBench. Could you provide reproducer(s)?

Sorry but I've lost track of what the case was. But I'm planing to take a look at branch profile info propagation with Vidoe-SIMDBench and others to improve the precision.

twoh updated this revision to Diff 99616.Fri, May 19, 1:20 PM

Addressing Ayal's comments.

Ayal added a comment.Sat, May 20, 6:40 AM

Thanks @twoh - this looks fine to me, and is pending @mkuper's approval following his original concern:

When I tried to play with this kind of thing, I got rather inconsistent performance results, because of things like a cold loop within a relatively hot loop.
Do you have any performance numbers for this?

Having lifted the OptForSize constraint when vectorizing cold loops, do you see any increase in compile-time / code-size when compiling with profile information?

A follow-up thought raised by this patch, which deserves a separate discussion/patch: should loops whose trip-count is smaller than TinyTripCountVectorThreshold, statically or by profile, be vectorized under OptForSize constraint rather than non-vectorized? The OptForSize constraint implies little if any overheads outside of the vectorized loop body, so the current cost estimate of the vectorized-vs-scalar loop body should hopefully be more/sufficiently accurate.

twoh added a comment.Mon, May 22, 3:33 PM

Comparing the existing implementation and this patch, I don't observe noticeable compile time different with Video-SIMDBench. I compiled the benchmark suite 3 times each and the median was 27.12 sec vs 27.55 sec, while the average was 27.96sec vs 27.89 sec. There were some code size differences, but it is simply more vectorization results bigger code size. There's no difference between OptForSize and non-vectorized.

I observe that branch frequency metadata handling has been improve since I first submitted this diff, which makes difference for some benchmarks. For example, mc_chroma, which was not vectorized with existing implementation while vectorized with this patch, is now vectorized with the original implementation but not with this patch. However, branch frequency metadata are still not perfect, and actually I was able to find 4 loops in mc_luma that whose loop entry frequency information is available but not the estimated trip count. These loops are vectorized if we make a vectorization decision only based on a trip count, but not vectorized if we consider loop entry frequency if trip count is not available, because their entry frequency is smaller than the threshold. Also, there are loops whose trip counts are underestimated and miss the vectorization opportunity. By chance, these loops result better performance with existing implementation because the loop entry frequency is higher than the cold entry frequency.

In summary, I don't see much difference between OptForSize and non-vectorized, but see the potential of better vectorization decision with more precise profile info for this patch.