This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorizer should adjust trip count in profile information
ClosedPublic

Authored by ebrevnov on Sep 23 2019, 4:33 AM.

Details

Summary

Vectorized loop processes VFxUF number of elements in one iteration thus total number of iterations decreases proportionally. In addition epilog loop may not have more than VFxUF - 1 iterations. This patch updates profile information accordingly.

Diff Detail

Event Timeline

ebrevnov created this revision.Sep 23 2019, 4:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2019, 4:33 AM
ebrevnov updated this revision to Diff 222121.Sep 27 2019, 3:28 AM

Minor test update

DaniilSuchkov accepted this revision.Nov 12 2019, 10:20 PM

LGTM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3977

Nit: VFxUF - 1

3983–3985

Style: it is usually advised to turn such conditions into early exits, it would reduce required indentation and slightly improve readability.

4004

Maybe introduce a new variable for this value (like EpilogueTakenCount)? Right now it's a bit surprising that OrigSomething is being changed. Same goes to OrigFallThroughCount.

This revision is now accepted and ready to land.Nov 12 2019, 10:20 PM
ebrevnov updated this revision to Diff 229066.Nov 13 2019, 5:15 AM

Minor fixes as requested by reviewer.

ebrevnov updated this revision to Diff 229260.Nov 14 2019, 3:41 AM

I realized that current implementation has a flaw and and we should take into account that actual number of iterations is one greater than back edge taken count. In addition I believe that current structuring of calculations is easier for understanding.

Ayal added a comment.Nov 20 2019, 3:46 AM

Adding a few comments. Would be good to generalize and apply also to loop unroll (and jam).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3986

OrigFallThroughCount can still be either the exit count or the continue-to-next-iteration count, according to the code below. Wait to test if its zero until we know what it stands for?

3998

Better use distinct names, e.g., OrigExitCount and OrigBackedgeTakenCount, than continue to call them Taken and FallThrough. Perhaps use Weight instead of Count, to denote total profile frequencies, as the latter is used elsewhere to denote the actual per-invocation TripCount.

4001

bel[l]ow

4003

How about "OrigAverageTripCount"?

Explanation about its computation:
OrigAverageTripCount = (number of times header block was executed) / (number of times header was reached from pre-header == number of times latch exited)
== (OrigTakenCount + OrigFallThroughCount) / OrigFallThroughCount
== OrigTakenCount / OrigFallThroughCount + 1.

4005

How about VecAverageTripCount = OrigAverageTripCount / (VF * UF);

4009

Just to clarify, maintaining branch frequencies through optimizations is best-effort and imprecise - a total weight that does not divide VF*UF implies that the trip count of at-least one invocation did not divide VF*UF, not necessarily all of them; w/o considering also the distribution of trip counts in addition to their sum.

Setting PRIterCount = 0 and VecAverageTipCount = round(OrigAverageTripCount / (VF*UV)) when Cost->foldTailByMasking() is probably the best that can be done. The former is redundant given that it applies to dead code, and the latter should perhaps apply to all cases, in general.

4013

There's also the special case of requiresScalarEpiloque() where 0 < PEIterCount <= VF*UF for each invocation of the loop, and hence the average is also strictly positive FWIW. But best keep the approximation general instead of trying to improve it, given general lack of information.

4020

This assumes the number of times the vector loop will be reached is equal to the number of times the original scalar loop was reached (OrigFallThrougCount). This holds is Cost->foldTailByMasking(), but otherwise invocations whose trip count < VF*UF will bypass the vector loop (and also == VF*UF if requireScalarEpilogue()), plus other run time guards.

4029

Similar to above comment, invocations whose trip count divides VF*UF will bypass the scalar remainder loop (w/o foldTailByMasking nor requireScalarEpilogue), so in general PEFallThroughCount <= OrigFallThroughCount.

llvm/test/Transforms/LoopVectorize/check-prof-info.ll
4

May want to also check with UF>1.

ebrevnov updated this revision to Diff 230432.Nov 21 2019, 4:57 AM
ebrevnov marked 7 inline comments as done.

Addressed Ayal's comments

ebrevnov marked an inline comment as done.Nov 21 2019, 4:59 AM
ebrevnov added inline comments.Nov 21 2019, 4:59 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3986

Good catch. Thanks!

3998

I fixed names. But I don't see reasons to use different variables here (if this is what you meant)

4001

fixed

4003

Ok. Turned your explanation to a comment.

4013

Agree. Let me remove this special case then.

4020

Please note that VecFallThrough is zero initially and set to OrigFallThrougCount only if vector loop is expected to be executed (VecIterCount > 0)

4029

Same explanation as for the above.

llvm/test/Transforms/LoopVectorize/check-prof-info.ll
4

I replaced masked case since we don't do anything special for it now.

Ayal added a comment.Dec 5 2019, 5:00 AM

Still think it would be better to provide this as a standalone function in Transforms/Utils/LoopUtils, for potential benefit of loop unroll (and jam) passes in addition to LV. Having agreed to ignore foldTail and requiresScalarEpilog, there's nothing vectorization-specific to do here. There's still an issue though with the fact that LV may use the scalar loop for both the remaining TC%(VF*UF) iterations when running the vector loop, and for all TC iterations when runtime guards bypass the vector loop. In absence of information, each such guard could be assigned 0.5 probability, or one could be aggressively optimistic and hope vector loop is always reached. In any case this deserves a comment.

Suggesting further variable name changes for the three Orig, Unrolled, and Remainder Loops, each having a LoopEntry==LoopExit edge weight, a Backedge weight, a HeaderBlock weight, and an AverageTripCount. The actual weights are recorded as TrueVal and FalseVal of the latch branches.

Patch needs to be clang-format'ed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3975

"is less ... than original TC" >> "is smaller than the original TC by a factor of VFxUF"

3982

OrigTakenWeight >> OrigBackedgeTakenWeight or OrigBackedgeWeight ?
OrigExitWeight >> OrigLoopExitWeight?
May help to also set OrigLoopEntryWeight = OrigLoopExitWeight?

3985

OrigBackBranchI >> OrigLoopLatchBranch ?

3988

It seems clearer to call extractProfMetadata(TrueVal, FalseVal) and then set BackedgeTaken and LoopExit/Entry weights according to if (OrigLoopLatchBranch->getSuccessor(0) == OrigLoop->getHeader()), following LoopUtil's getLoopEstimatedTripCount(). Analogously for createBranchWeights(TrueVal, FalseVal).

In any case, better rename "IsTrueBackEdge*Loop".

4002

Patch needs to be clang-format'ed

4009

Instead of providing the explanation in a comment, seems better to implement the code this way, leaving the +1 for the compiler to optimize. I.e.,

const uint64_t OrigHeaderBlockWeight = OrigBackedgeTakenWeight + OrigLoopEntryWeight;
const unit64_t OrigAverageTripCount = OrigHeaderBlockWeight / OrigLoopEntryWeight;
4016

Better rename/expand "PE".
PEIterCount >> RemainderLoopAverageTripCount?

4020

In the general context, "Vec" >> "UnrolledLoop".
VecTakenCount >> UnrolledLoopBackedgeWeight
VecFallThrough >> UnrolledLoopExitWeight and/or UnrolledLoopEntryWeight

4022

How about

if (UnrolledLoopAverageTripCount > 0) {
  UnrolledLoopEntryWeight = OrigLoopEntryWeight;
  uint64_t UnrolledLoopHeaderWeight = UnrolledLoopAverageTripCount * UnrolledLoopEntryWeight; // Analogous to computing OrigLoopAverageTripCount from Header and Entry weights above.
  UnrolledLoopBackedgeWeight = UnrolledLoopHeaderWeight - UnrolledLoopEntryWeight;
}

leaving the -1 optimization to the compiler.

4028

PETakenCount >> RemainderLoopBackedgeWeight
PEFallThroughCount >> RemainderLoopExitWeight and/or RemainderLoopEntryWeight

ebrevnov marked 9 inline comments as done.Dec 9 2019, 1:55 AM

Still think it would be better to provide this as a standalone function in Transforms/Utils/LoopUtils, for potential benefit of loop unroll (and jam) passes in addition to LV. Having agreed to ignore foldTail and requiresScalarEpilog, there's nothing vectorization-specific to do here. There's still an issue though with the fact that LV may use the scalar loop for both the remaining TC%(VF*UF) iterations when running the vector loop, and for all TC iterations when runtime guards bypass the vector loop. In absence of information, each such guard could be assigned 0.5 probability, or one could be aggressively optimistic and hope vector loop is always reached. In any case this deserves a comment.

Suggesting further variable name changes for the three Orig, Unrolled, and Remainder Loops, each having a LoopEntry==LoopExit edge weight, a Backedge weight, a HeaderBlock weight, and an AverageTripCount. The actual weights are recorded as TrueVal and FalseVal of the latch branches.

Patch needs to be clang-format'ed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3988

Don't feel convinced. My point would be that extra variables and conditional reassignments make the code less readable. I think this is very subjective thing.

4022

That will make computations less stable to overflow. Personally I feel the way it's written today has the same level of complexity for understanding.

ebrevnov updated this revision to Diff 232778.Dec 9 2019, 1:56 AM

Addressing issues raised by Ayal.

ebrevnov updated this revision to Diff 232779.Dec 9 2019, 1:59 AM

Typo fixed.

Harbormaster completed remote builds in B42094: Diff 232779.
Ayal added inline comments.Dec 18 2019, 1:51 PM
llvm/include/llvm/Transforms/Utils/LoopUtils.h
361 ↗(On Diff #232779)

The fact that OrigLoop is both the original loop containing the original profile weights, and acts as the RemainderLoop dedicated to leftover iterations, should be clarified. Alternatively, this utility can receive three loops: OrigLoop, UnrolledLoop and RemainderLoop, leaving it to the caller to decide if to pass OrigLoop also as RemainderLoop.

Would probably be clearer to start with UnrolledLoop receiving weights that reflect TC/UF iterations, and then OrigLoop which receives weights that reflect the remaining TC%UF iterations.

363 ↗(On Diff #232779)

U[n]rolledLoop, several occurrences.

"\UF" >> "\p UF"

llvm/lib/Transforms/Utils/LoopUtils.cpp
1040 ↗(On Diff #232779)

Worth commenting that OrigLoopEntryWeight also holds OrigLoopExitWeight, which is more clearly the weight associated with the (exit direction of the) latch branch.

1048 ↗(On Diff #232779)

UnrolledBBI >> Unrolled[Loop]LatchBranch, as in OrigLoopLatchBranch.

As the names end up overflowing lines, can use Orig, Unrolled and Remainder to stand for OrigLoop, UnrolledLoop and RemainderLoop; i.e., taking "Loop" out.

1051 ↗(On Diff #232779)

VecLoop >> UnrolledLoop

1061 ↗(On Diff #232779)

Can drop the 'const', for consistency; these temporaries are obviously const's.

1064 ↗(On Diff #232779)

Note that this is rounding down. Can add half of the denominator to the nominator before dividing in order to round more accurately; this is what getLoopEstimatedTripCount() does, but it seems to be off by 1 as it computes BackEdgeTakenWeight / LoopEntryWeight rounded to nearest, instead of HeaderBlockWeight / LoopEntryWeight rounded to nearest...

Simply call OrigAverageTripCount = getLoopEstimatedTripCount(OrigLoop)?

Perhaps having a setLoopEstimatedTripCount(Loop, EstimatedTripCount, EstimatedEntryWeight) would help fold the identical treatment of UnrolledLoop and RemainderLoop into one function, which also takes care of figuring out the True/False vs. Backedge/Exit directions?

1066 ↗(On Diff #232779)

U[n]rolledAverageTripCount

1076 ↗(On Diff #232779)

Seems slightly more logical to first set UnrolledLoopEntryWeight, and then using it set UnrolledLoopBackedgeWeight.

1085 ↗(On Diff #232779)

ditto

1098 ↗(On Diff #232779)

(This actually replaces the old profile metadata with the new one.)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
118

Is this include still needed here?

3464

Comment below should start with a short sentence explaining that profile weights associated with the original loop are now distributed among the vector and scalar loops.

ebrevnov marked 16 inline comments as done.Dec 26 2019, 3:29 AM
ebrevnov added inline comments.
llvm/lib/Transforms/Utils/LoopUtils.cpp
1040 ↗(On Diff #232779)

IMHO instead of trying to clarify with a comment we better find self descriptive name for such a simple and commonly used thing. Strictly speaking OrigLoopEntryWeight != OrigLoopExitWeight. Do you find OrigBackEdgeExitWeight good enough?

1048 ↗(On Diff #232779)

Removed "Loop" from most names to make them a little shorter.

1064 ↗(On Diff #232779)

This "off by 1" stopped me from using it in the first place since that could be important in some cases. OK, let's reuse getLoopEstimatedTripCount. To be able to do that there are some changes to getLoopEstimatedTripCount.

ebrevnov updated this revision to Diff 235335.Dec 26 2019, 3:30 AM
ebrevnov marked 3 inline comments as done.

Updated as requested.

Hi @Ayal. Thanks for you input. I fixed all places as you suggested. Please check.

Ayal added a comment.Dec 26 2019, 8:44 AM

Thanks for making all the changes! More comments inline.

llvm/lib/Transforms/Utils/LoopUtils.cpp
680 ↗(On Diff #235335)

Comment what this new function is for. Rename (see below)? Retain "Support loops ..." comment, added in D64553?

712 ↗(On Diff #235335)

dyn_cast >> cast

Perhaps update above function to do here something like
BranchInst *LatchBR = getExpectedExitLoopLatchBranch(L)
checking if it returned nullptr or not?

733 ↗(On Diff #235335)

Thanks for taking care of this fix to improve accuracy of estimated TC!
But doing so deserves a separate patch, and tests. Note that it also effects loop unrolling, i.e., its effects are beyond LV. This part can be introduced either before or after the part that teaches LV to maintain profiling info.

754 ↗(On Diff #235335)

ditto (dyn_cast >> cast, ...)

755 ↗(On Diff #235335)

Better check similar to above if (LatchBR->getSuccessor(0) != L->getHeader())

1085 ↗(On Diff #235335)

"the \p UnrolledLoop \p RemainderLoop" >>
"\p UnrolledLoop and \p RemainderLoop"

1089 ↗(On Diff #235335)

May also be worthwhile asserting that UF is positive (or greater than 1?)

1090 ↗(On Diff #235335)

are expected to be distinct

1040 ↗(On Diff #232779)

IMHO instead of trying to clarify with a comment we better find self descriptive name for such a simple and commonly used thing.

Definitely agree with (the preference of) finding self descriptive variable names.

Strictly speaking OrigLoopEntryWeight != OrigLoopExitWeight"

How so, given that OrigLoop has a single-entry, and an "expected" single-exit: there may be other "side/deopt" exits, but these are expected to have zero weight. E.g., when computing LoopHeaderWeight above, adding LoopEntryWeight (instead of LoopExitWeight) to BackEdgeTakenWeight seems more logical.

Do you find OrigBackEdgeExitWeight good enough?

Strictly speaking, a BackEdge cannot exit: it is an edge going from a latch block to a header block, and "BackEdgeTaken" is the number of times this edge is taken=traversed, which equals the number of times its latch branch is "taken" rather than "falls-thru" (if it's true direction points to the header). Perhaps LoopEntryExitWeight or LoopInvocationWeight could be used instead/in addition.

1064 ↗(On Diff #232779)

This "off by 1" stopped me from using it in the first place since that could be important in some cases. OK, let's reuse getLoopEstimatedTripCount. To be able to do that there are some changes to getLoopEstimatedTripCount.

Indeed, best fix and reuse, in a dedicated patch as raised above, thereby isolating impact on such "some cases".

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3473

"is not taken into account as unlikely case" >>
"is ignored, assigning all the weight to the vector loop, optimistically"

llvm/test/Transforms/LoopVectorize/check-prof-info.ll
7

Tests targeting x86 need to reside in LoopVectorize/X86

ebrevnov updated this revision to Diff 235579.Dec 30 2019, 4:31 AM
ebrevnov marked 15 inline comments as done.

One more round of updates.

ebrevnov added inline comments.Dec 30 2019, 11:47 PM
llvm/lib/Transforms/Utils/LoopUtils.cpp
680 ↗(On Diff #235335)

The comment is related to get/setLoopEstimatedTripCount and still there...

712 ↗(On Diff #235335)

I was thinking about that in the first place but didn't come up with a "good" enough name. I can go with that name if you like :-)

733 ↗(On Diff #235335)

Ok.

1089 ↗(On Diff #235335)

I think we better support 1 which could be used in some corner cases....

1040 ↗(On Diff #232779)

How so, given that OrigLoop has a single-entry, and an "expected" single-exit: there may be other "side/deopt" exits, but these are expected to have zero weight. E.g., when computing LoopHeaderWeight above, adding LoopEntryWeight (instead of LoopExitWeight) to BackEdgeTakenWeight seems more logical.

For infinite loop entry is 1 while exit is 0 :-). I understand this is extreme we will never meet but still....

Strictly speaking, a BackEdge cannot exit: it is an edge going from a latch block to a header block, and "BackEdgeTaken" is the number of times this edge is taken=traversed, which equals the number of times its latch branch is "taken" rather than "falls-thru" (if it's true direction points to the header).

That's why I used TakenCount and FallThroughCount in the very first version what perfectly matches your description. I don't feel we are getting any better names with more iterations.... Perhaps LatchCycleWeight - number of times we go to loop header from the latch and LatchExitWeight - number of times we go to loop exit from the latch?

Perhaps LoopEntryExitWeight or LoopInvocationWeight could be used instead/in addition.

I think LoopEntryExitWeight may be confusing....
I think it makes sense to use EstimatedLoopInvocationWeight in conjunction with EstimatedTripCount as parameters to get/setEstimatedTripCount interface while LatchCycleWeight and LatchExitWeight in the implementation as they are little-bit more low level.

Ayal accepted this revision.Jan 15 2020, 3:59 PM

This looks good to me, thanks!

llvm/lib/Transforms/Utils/LoopUtils.cpp
775 ↗(On Diff #236991)

The last part could call fixupBranchWeights() if moved here from Transforms/Utils/LoopUnrollPeel.cpp

llvm/test/Transforms/LoopVectorize/tripcount.ll
211

Following this, to clarify:
original loop has latchExitWeight=10 and backedgeTakenWeight=10,000, therefore estimatedBackedgeTakenCount=1,000 and estimatedTripCount=1,001.
Vectorizing by 4 produces
estimatedTripCounts of 1,001/4=250 and 1,001%4=1 for vectorized and remainder loops, respectively, therefore their
estimatedBackedgeTakenCounts are 249 and 0, and so the weights recorded with loop invocation weights of 10 are the above {10, 2490} and {10, 0}.

ebrevnov marked 2 inline comments as done.Jan 15 2020, 7:17 PM
ebrevnov added inline comments.
llvm/lib/Transforms/Utils/LoopUtils.cpp
775 ↗(On Diff #236991)

That would require to change implementation for fixupBranchWeights since it disregards to update when back edge taken count is zero.

llvm/test/Transforms/LoopVectorize/tripcount.ll
211

I will add this text to the test. I that what you wanted (just not sure :-))?

Ayal added inline comments.Jan 16 2020, 11:55 PM
llvm/lib/Transforms/Utils/LoopUtils.cpp
775 ↗(On Diff #236991)

Agreed. Regarding disregarding to update when backedge taken weight is zero, note that in fixupBranchWeights(), BackedgeTakenWeight is called FallThroughWeight(?) "The weight of the edge from Latch to Header", and that

// FallThroughWeight is 0 means that there is no branch weights on original
// latch block or estimated trip count is zero.

Regarding the first meaning of 0, whoever calls fixupBranchWeights() should do so only if there were such weights on the original latch block, similar to the caller of setLoopEstimatedTripCount(). Regarding the second meaning of 0, it seem fixupBranchWeights() suffers from same +1 issue: estimating trip count to be zero when backedge taken weight is zero.

It would be good to fix and centralize the support for updating weights of loops, but such refactoring can be done as a separate follow-up patch, after landing this (accepted) patch.

This revision was automatically updated to reflect the committed changes.