Page MenuHomePhabricator

[LV] Vectorizer should adjust trip count in profile information
AcceptedPublic

Authored by ebrevnov on Sep 23 2019, 4:33 AM.

Details

Summary

Vectorized loop processes VFxUF number of elements in one iteration thus total number of iterations decreases proportionally. In addition epilog loop may not have more than VFxUF - 1 iterations. This patch updates profile information accordingly.

Event Timeline

ebrevnov created this revision.Sep 23 2019, 4:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2019, 4:33 AM
ebrevnov updated this revision to Diff 222121.Sep 27 2019, 3:28 AM

Minor test update

DaniilSuchkov accepted this revision.Nov 12 2019, 10:20 PM

LGTM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3975

Nit: VFxUF - 1

3981–3983

Style: it is usually advised to turn such conditions into early exits, it would reduce required indentation and slightly improve readability.

4002

Maybe introduce a new variable for this value (like EpilogueTakenCount)? Right now it's a bit surprising that OrigSomething is being changed. Same goes to OrigFallThroughCount.

This revision is now accepted and ready to land.Nov 12 2019, 10:20 PM
ebrevnov updated this revision to Diff 229066.Nov 13 2019, 5:15 AM

Minor fixes as requested by reviewer.

ebrevnov updated this revision to Diff 229260.Thu, Nov 14, 3:41 AM

I realized that current implementation has a flaw and and we should take into account that actual number of iterations is one greater than back edge taken count. In addition I believe that current structuring of calculations is easier for understanding.

Ayal added a comment.Wed, Nov 20, 3:46 AM

Adding a few comments. Would be good to generalize and apply also to loop unroll (and jam).

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3984

OrigFallThroughCount can still be either the exit count or the continue-to-next-iteration count, according to the code below. Wait to test if its zero until we know what it stands for?

3996

Better use distinct names, e.g., OrigExitCount and OrigBackedgeTakenCount, than continue to call them Taken and FallThrough. Perhaps use Weight instead of Count, to denote total profile frequencies, as the latter is used elsewhere to denote the actual per-invocation TripCount.

3999

bel[l]ow

4001

How about "OrigAverageTripCount"?

Explanation about its computation:
OrigAverageTripCount = (number of times header block was executed) / (number of times header was reached from pre-header == number of times latch exited)
== (OrigTakenCount + OrigFallThroughCount) / OrigFallThroughCount
== OrigTakenCount / OrigFallThroughCount + 1.

4003

How about VecAverageTripCount = OrigAverageTripCount / (VF * UF);

4007

Just to clarify, maintaining branch frequencies through optimizations is best-effort and imprecise - a total weight that does not divide VF*UF implies that the trip count of at-least one invocation did not divide VF*UF, not necessarily all of them; w/o considering also the distribution of trip counts in addition to their sum.

Setting PRIterCount = 0 and VecAverageTipCount = round(OrigAverageTripCount / (VF*UV)) when Cost->foldTailByMasking() is probably the best that can be done. The former is redundant given that it applies to dead code, and the latter should perhaps apply to all cases, in general.

4011

There's also the special case of requiresScalarEpiloque() where 0 < PEIterCount <= VF*UF for each invocation of the loop, and hence the average is also strictly positive FWIW. But best keep the approximation general instead of trying to improve it, given general lack of information.

4018

This assumes the number of times the vector loop will be reached is equal to the number of times the original scalar loop was reached (OrigFallThrougCount). This holds is Cost->foldTailByMasking(), but otherwise invocations whose trip count < VF*UF will bypass the vector loop (and also == VF*UF if requireScalarEpilogue()), plus other run time guards.

4027

Similar to above comment, invocations whose trip count divides VF*UF will bypass the scalar remainder loop (w/o foldTailByMasking nor requireScalarEpilogue), so in general PEFallThroughCount <= OrigFallThroughCount.

llvm/test/Transforms/LoopVectorize/check-prof-info.ll
4

May want to also check with UF>1.

ebrevnov updated this revision to Diff 230432.Thu, Nov 21, 4:57 AM
ebrevnov marked 7 inline comments as done.

Addressed Ayal's comments

ebrevnov marked an inline comment as done.Thu, Nov 21, 4:59 AM
ebrevnov added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3984

Good catch. Thanks!

3996

I fixed names. But I don't see reasons to use different variables here (if this is what you meant)

3999

fixed

4001

Ok. Turned your explanation to a comment.

4011

Agree. Let me remove this special case then.

4018

Please note that VecFallThrough is zero initially and set to OrigFallThrougCount only if vector loop is expected to be executed (VecIterCount > 0)

4027

Same explanation as for the above.

llvm/test/Transforms/LoopVectorize/check-prof-info.ll
4

I replaced masked case since we don't do anything special for it now.

Ayal added a comment.Thu, Dec 5, 5:00 AM

Still think it would be better to provide this as a standalone function in Transforms/Utils/LoopUtils, for potential benefit of loop unroll (and jam) passes in addition to LV. Having agreed to ignore foldTail and requiresScalarEpilog, there's nothing vectorization-specific to do here. There's still an issue though with the fact that LV may use the scalar loop for both the remaining TC%(VF*UF) iterations when running the vector loop, and for all TC iterations when runtime guards bypass the vector loop. In absence of information, each such guard could be assigned 0.5 probability, or one could be aggressively optimistic and hope vector loop is always reached. In any case this deserves a comment.

Suggesting further variable name changes for the three Orig, Unrolled, and Remainder Loops, each having a LoopEntry==LoopExit edge weight, a Backedge weight, a HeaderBlock weight, and an AverageTripCount. The actual weights are recorded as TrueVal and FalseVal of the latch branches.

Patch needs to be clang-format'ed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3973

"is less ... than original TC" >> "is smaller than the original TC by a factor of VFxUF"

3980

OrigTakenWeight >> OrigBackedgeTakenWeight or OrigBackedgeWeight ?
OrigExitWeight >> OrigLoopExitWeight?
May help to also set OrigLoopEntryWeight = OrigLoopExitWeight?

3983

OrigBackBranchI >> OrigLoopLatchBranch ?

3986

It seems clearer to call extractProfMetadata(TrueVal, FalseVal) and then set BackedgeTaken and LoopExit/Entry weights according to if (OrigLoopLatchBranch->getSuccessor(0) == OrigLoop->getHeader()), following LoopUtil's getLoopEstimatedTripCount(). Analogously for createBranchWeights(TrueVal, FalseVal).

In any case, better rename "IsTrueBackEdge*Loop".

4000

Patch needs to be clang-format'ed

4007

Instead of providing the explanation in a comment, seems better to implement the code this way, leaving the +1 for the compiler to optimize. I.e.,

const uint64_t OrigHeaderBlockWeight = OrigBackedgeTakenWeight + OrigLoopEntryWeight;
const unit64_t OrigAverageTripCount = OrigHeaderBlockWeight / OrigLoopEntryWeight;
4014

Better rename/expand "PE".
PEIterCount >> RemainderLoopAverageTripCount?

4018

In the general context, "Vec" >> "UnrolledLoop".
VecTakenCount >> UnrolledLoopBackedgeWeight
VecFallThrough >> UnrolledLoopExitWeight and/or UnrolledLoopEntryWeight

4020

How about

if (UnrolledLoopAverageTripCount > 0) {
  UnrolledLoopEntryWeight = OrigLoopEntryWeight;
  uint64_t UnrolledLoopHeaderWeight = UnrolledLoopAverageTripCount * UnrolledLoopEntryWeight; // Analogous to computing OrigLoopAverageTripCount from Header and Entry weights above.
  UnrolledLoopBackedgeWeight = UnrolledLoopHeaderWeight - UnrolledLoopEntryWeight;
}

leaving the -1 optimization to the compiler.

4026

PETakenCount >> RemainderLoopBackedgeWeight
PEFallThroughCount >> RemainderLoopExitWeight and/or RemainderLoopEntryWeight

ebrevnov marked 9 inline comments as done.Mon, Dec 9, 1:55 AM

Still think it would be better to provide this as a standalone function in Transforms/Utils/LoopUtils, for potential benefit of loop unroll (and jam) passes in addition to LV. Having agreed to ignore foldTail and requiresScalarEpilog, there's nothing vectorization-specific to do here. There's still an issue though with the fact that LV may use the scalar loop for both the remaining TC%(VF*UF) iterations when running the vector loop, and for all TC iterations when runtime guards bypass the vector loop. In absence of information, each such guard could be assigned 0.5 probability, or one could be aggressively optimistic and hope vector loop is always reached. In any case this deserves a comment.

Suggesting further variable name changes for the three Orig, Unrolled, and Remainder Loops, each having a LoopEntry==LoopExit edge weight, a Backedge weight, a HeaderBlock weight, and an AverageTripCount. The actual weights are recorded as TrueVal and FalseVal of the latch branches.

Patch needs to be clang-format'ed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3986

Don't feel convinced. My point would be that extra variables and conditional reassignments make the code less readable. I think this is very subjective thing.

4020

That will make computations less stable to overflow. Personally I feel the way it's written today has the same level of complexity for understanding.

ebrevnov updated this revision to Diff 232778.Mon, Dec 9, 1:56 AM

Addressing issues raised by Ayal.

ebrevnov updated this revision to Diff 232779.Mon, Dec 9, 1:59 AM

Typo fixed.

Harbormaster completed remote builds in B42094: Diff 232779.