Page MenuHomePhabricator

[LV] Vectorizer should adjust trip count in profile information
ClosedPublic

Authored by ebrevnov on Sep 23 2019, 4:33 AM.

Details

Summary

Vectorized loop processes VFxUF number of elements in one iteration thus total number of iterations decreases proportionally. In addition epilog loop may not have more than VFxUF - 1 iterations. This patch updates profile information accordingly.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ebrevnov added inline comments.Nov 21 2019, 4:59 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4064

Good catch. Thanks!

4076

I fixed names. But I don't see reasons to use different variables here (if this is what you meant)

4079

fixed

4081

Ok. Turned your explanation to a comment.

4091

Agree. Let me remove this special case then.

4098

Please note that VecFallThrough is zero initially and set to OrigFallThrougCount only if vector loop is expected to be executed (VecIterCount > 0)

4107

Same explanation as for the above.

llvm/test/Transforms/LoopVectorize/check-prof-info.ll
4

I replaced masked case since we don't do anything special for it now.

Ayal added a comment.Dec 5 2019, 5:00 AM

Still think it would be better to provide this as a standalone function in Transforms/Utils/LoopUtils, for potential benefit of loop unroll (and jam) passes in addition to LV. Having agreed to ignore foldTail and requiresScalarEpilog, there's nothing vectorization-specific to do here. There's still an issue though with the fact that LV may use the scalar loop for both the remaining TC%(VF*UF) iterations when running the vector loop, and for all TC iterations when runtime guards bypass the vector loop. In absence of information, each such guard could be assigned 0.5 probability, or one could be aggressively optimistic and hope vector loop is always reached. In any case this deserves a comment.

Suggesting further variable name changes for the three Orig, Unrolled, and Remainder Loops, each having a LoopEntry==LoopExit edge weight, a Backedge weight, a HeaderBlock weight, and an AverageTripCount. The actual weights are recorded as TrueVal and FalseVal of the latch branches.

Patch needs to be clang-format'ed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4053

"is less ... than original TC" >> "is smaller than the original TC by a factor of VFxUF"

4060

OrigTakenWeight >> OrigBackedgeTakenWeight or OrigBackedgeWeight ?
OrigExitWeight >> OrigLoopExitWeight?
May help to also set OrigLoopEntryWeight = OrigLoopExitWeight?

4063

OrigBackBranchI >> OrigLoopLatchBranch ?

4066

It seems clearer to call extractProfMetadata(TrueVal, FalseVal) and then set BackedgeTaken and LoopExit/Entry weights according to if (OrigLoopLatchBranch->getSuccessor(0) == OrigLoop->getHeader()), following LoopUtil's getLoopEstimatedTripCount(). Analogously for createBranchWeights(TrueVal, FalseVal).

In any case, better rename "IsTrueBackEdge*Loop".

4080

Patch needs to be clang-format'ed

4087

Instead of providing the explanation in a comment, seems better to implement the code this way, leaving the +1 for the compiler to optimize. I.e.,

const uint64_t OrigHeaderBlockWeight = OrigBackedgeTakenWeight + OrigLoopEntryWeight;
const unit64_t OrigAverageTripCount = OrigHeaderBlockWeight / OrigLoopEntryWeight;
4094

Better rename/expand "PE".
PEIterCount >> RemainderLoopAverageTripCount?

4098

In the general context, "Vec" >> "UnrolledLoop".
VecTakenCount >> UnrolledLoopBackedgeWeight
VecFallThrough >> UnrolledLoopExitWeight and/or UnrolledLoopEntryWeight

4100

How about

if (UnrolledLoopAverageTripCount > 0) {
  UnrolledLoopEntryWeight = OrigLoopEntryWeight;
  uint64_t UnrolledLoopHeaderWeight = UnrolledLoopAverageTripCount * UnrolledLoopEntryWeight; // Analogous to computing OrigLoopAverageTripCount from Header and Entry weights above.
  UnrolledLoopBackedgeWeight = UnrolledLoopHeaderWeight - UnrolledLoopEntryWeight;
}

leaving the -1 optimization to the compiler.

4106

PETakenCount >> RemainderLoopBackedgeWeight
PEFallThroughCount >> RemainderLoopExitWeight and/or RemainderLoopEntryWeight

ebrevnov marked 9 inline comments as done.Dec 9 2019, 1:55 AM

Still think it would be better to provide this as a standalone function in Transforms/Utils/LoopUtils, for potential benefit of loop unroll (and jam) passes in addition to LV. Having agreed to ignore foldTail and requiresScalarEpilog, there's nothing vectorization-specific to do here. There's still an issue though with the fact that LV may use the scalar loop for both the remaining TC%(VF*UF) iterations when running the vector loop, and for all TC iterations when runtime guards bypass the vector loop. In absence of information, each such guard could be assigned 0.5 probability, or one could be aggressively optimistic and hope vector loop is always reached. In any case this deserves a comment.

Suggesting further variable name changes for the three Orig, Unrolled, and Remainder Loops, each having a LoopEntry==LoopExit edge weight, a Backedge weight, a HeaderBlock weight, and an AverageTripCount. The actual weights are recorded as TrueVal and FalseVal of the latch branches.

Patch needs to be clang-format'ed.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4066

Don't feel convinced. My point would be that extra variables and conditional reassignments make the code less readable. I think this is very subjective thing.

4100

That will make computations less stable to overflow. Personally I feel the way it's written today has the same level of complexity for understanding.

ebrevnov updated this revision to Diff 232778.Dec 9 2019, 1:56 AM

Addressing issues raised by Ayal.

ebrevnov updated this revision to Diff 232779.Dec 9 2019, 1:59 AM

Typo fixed.

Harbormaster completed remote builds in B42094: Diff 232779.
Ayal added inline comments.Dec 18 2019, 1:51 PM
llvm/include/llvm/Transforms/Utils/LoopUtils.h
374

The fact that OrigLoop is both the original loop containing the original profile weights, and acts as the RemainderLoop dedicated to leftover iterations, should be clarified. Alternatively, this utility can receive three loops: OrigLoop, UnrolledLoop and RemainderLoop, leaving it to the caller to decide if to pass OrigLoop also as RemainderLoop.

Would probably be clearer to start with UnrolledLoop receiving weights that reflect TC/UF iterations, and then OrigLoop which receives weights that reflect the remaining TC%UF iterations.

376

U[n]rolledLoop, several occurrences.

"\UF" >> "\p UF"

llvm/lib/Transforms/Utils/LoopUtils.cpp
1102

Worth commenting that OrigLoopEntryWeight also holds OrigLoopExitWeight, which is more clearly the weight associated with the (exit direction of the) latch branch.

1110

UnrolledBBI >> Unrolled[Loop]LatchBranch, as in OrigLoopLatchBranch.

As the names end up overflowing lines, can use Orig, Unrolled and Remainder to stand for OrigLoop, UnrolledLoop and RemainderLoop; i.e., taking "Loop" out.

1113

VecLoop >> UnrolledLoop

1123

Can drop the 'const', for consistency; these temporaries are obviously const's.

1126

Note that this is rounding down. Can add half of the denominator to the nominator before dividing in order to round more accurately; this is what getLoopEstimatedTripCount() does, but it seems to be off by 1 as it computes BackEdgeTakenWeight / LoopEntryWeight rounded to nearest, instead of HeaderBlockWeight / LoopEntryWeight rounded to nearest...

Simply call OrigAverageTripCount = getLoopEstimatedTripCount(OrigLoop)?

Perhaps having a setLoopEstimatedTripCount(Loop, EstimatedTripCount, EstimatedEntryWeight) would help fold the identical treatment of UnrolledLoop and RemainderLoop into one function, which also takes care of figuring out the True/False vs. Backedge/Exit directions?

1128

U[n]rolledAverageTripCount

1138

Seems slightly more logical to first set UnrolledLoopEntryWeight, and then using it set UnrolledLoopBackedgeWeight.

1147

ditto

1160

(This actually replaces the old profile metadata with the new one.)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
118

Is this include still needed here?

3486

Comment below should start with a short sentence explaining that profile weights associated with the original loop are now distributed among the vector and scalar loops.

ebrevnov marked 16 inline comments as done.Dec 26 2019, 3:29 AM
ebrevnov added inline comments.
llvm/lib/Transforms/Utils/LoopUtils.cpp
1102

IMHO instead of trying to clarify with a comment we better find self descriptive name for such a simple and commonly used thing. Strictly speaking OrigLoopEntryWeight != OrigLoopExitWeight. Do you find OrigBackEdgeExitWeight good enough?

1110

Removed "Loop" from most names to make them a little shorter.

1126

This "off by 1" stopped me from using it in the first place since that could be important in some cases. OK, let's reuse getLoopEstimatedTripCount. To be able to do that there are some changes to getLoopEstimatedTripCount.

ebrevnov updated this revision to Diff 235335.Dec 26 2019, 3:30 AM
ebrevnov marked 3 inline comments as done.

Updated as requested.

Hi @Ayal. Thanks for you input. I fixed all places as you suggested. Please check.

Ayal added a comment.Dec 26 2019, 8:44 AM

Thanks for making all the changes! More comments inline.

llvm/lib/Transforms/Utils/LoopUtils.cpp
693–694

Comment what this new function is for. Rename (see below)? Retain "Support loops ..." comment, added in D64553?

730

dyn_cast >> cast

Perhaps update above function to do here something like
BranchInst *LatchBR = getExpectedExitLoopLatchBranch(L)
checking if it returned nullptr or not?

742

Thanks for taking care of this fix to improve accuracy of estimated TC!
But doing so deserves a separate patch, and tests. Note that it also effects loop unrolling, i.e., its effects are beyond LV. This part can be introduced either before or after the part that teaches LV to maintain profiling info.

763

ditto (dyn_cast >> cast, ...)

764

Better check similar to above if (LatchBR->getSuccessor(0) != L->getHeader())

1098

"the \p UnrolledLoop \p RemainderLoop" >>
"\p UnrolledLoop and \p RemainderLoop"

1102

May also be worthwhile asserting that UF is positive (or greater than 1?)

1102

IMHO instead of trying to clarify with a comment we better find self descriptive name for such a simple and commonly used thing.

Definitely agree with (the preference of) finding self descriptive variable names.

Strictly speaking OrigLoopEntryWeight != OrigLoopExitWeight"

How so, given that OrigLoop has a single-entry, and an "expected" single-exit: there may be other "side/deopt" exits, but these are expected to have zero weight. E.g., when computing LoopHeaderWeight above, adding LoopEntryWeight (instead of LoopExitWeight) to BackEdgeTakenWeight seems more logical.

Do you find OrigBackEdgeExitWeight good enough?

Strictly speaking, a BackEdge cannot exit: it is an edge going from a latch block to a header block, and "BackEdgeTaken" is the number of times this edge is taken=traversed, which equals the number of times its latch branch is "taken" rather than "falls-thru" (if it's true direction points to the header). Perhaps LoopEntryExitWeight or LoopInvocationWeight could be used instead/in addition.

1103

are expected to be distinct

1126

This "off by 1" stopped me from using it in the first place since that could be important in some cases. OK, let's reuse getLoopEstimatedTripCount. To be able to do that there are some changes to getLoopEstimatedTripCount.

Indeed, best fix and reuse, in a dedicated patch as raised above, thereby isolating impact on such "some cases".

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3495

"is not taken into account as unlikely case" >>
"is ignored, assigning all the weight to the vector loop, optimistically"

llvm/test/Transforms/LoopVectorize/check-prof-info.ll
7

Tests targeting x86 need to reside in LoopVectorize/X86

ebrevnov updated this revision to Diff 235579.Dec 30 2019, 4:31 AM
ebrevnov marked 15 inline comments as done.

One more round of updates.

ebrevnov added inline comments.Dec 30 2019, 11:47 PM
llvm/lib/Transforms/Utils/LoopUtils.cpp
693–694

The comment is related to get/setLoopEstimatedTripCount and still there...

730

I was thinking about that in the first place but didn't come up with a "good" enough name. I can go with that name if you like :-)

742

Ok.

1102

I think we better support 1 which could be used in some corner cases....

1102

How so, given that OrigLoop has a single-entry, and an "expected" single-exit: there may be other "side/deopt" exits, but these are expected to have zero weight. E.g., when computing LoopHeaderWeight above, adding LoopEntryWeight (instead of LoopExitWeight) to BackEdgeTakenWeight seems more logical.

For infinite loop entry is 1 while exit is 0 :-). I understand this is extreme we will never meet but still....

Strictly speaking, a BackEdge cannot exit: it is an edge going from a latch block to a header block, and "BackEdgeTaken" is the number of times this edge is taken=traversed, which equals the number of times its latch branch is "taken" rather than "falls-thru" (if it's true direction points to the header).

That's why I used TakenCount and FallThroughCount in the very first version what perfectly matches your description. I don't feel we are getting any better names with more iterations.... Perhaps LatchCycleWeight - number of times we go to loop header from the latch and LatchExitWeight - number of times we go to loop exit from the latch?

Perhaps LoopEntryExitWeight or LoopInvocationWeight could be used instead/in addition.

I think LoopEntryExitWeight may be confusing....
I think it makes sense to use EstimatedLoopInvocationWeight in conjunction with EstimatedTripCount as parameters to get/setEstimatedTripCount interface while LatchCycleWeight and LatchExitWeight in the implementation as they are little-bit more low level.

Ayal accepted this revision.Jan 15 2020, 3:59 PM

This looks good to me, thanks!

llvm/lib/Transforms/Utils/LoopUtils.cpp
779

The last part could call fixupBranchWeights() if moved here from Transforms/Utils/LoopUnrollPeel.cpp

llvm/test/Transforms/LoopVectorize/tripcount.ll
211

Following this, to clarify:
original loop has latchExitWeight=10 and backedgeTakenWeight=10,000, therefore estimatedBackedgeTakenCount=1,000 and estimatedTripCount=1,001.
Vectorizing by 4 produces
estimatedTripCounts of 1,001/4=250 and 1,001%4=1 for vectorized and remainder loops, respectively, therefore their
estimatedBackedgeTakenCounts are 249 and 0, and so the weights recorded with loop invocation weights of 10 are the above {10, 2490} and {10, 0}.

ebrevnov marked 2 inline comments as done.Jan 15 2020, 7:17 PM
ebrevnov added inline comments.
llvm/lib/Transforms/Utils/LoopUtils.cpp
779

That would require to change implementation for fixupBranchWeights since it disregards to update when back edge taken count is zero.

llvm/test/Transforms/LoopVectorize/tripcount.ll
211

I will add this text to the test. I that what you wanted (just not sure :-))?

Ayal added inline comments.Jan 16 2020, 11:55 PM
llvm/lib/Transforms/Utils/LoopUtils.cpp
779

Agreed. Regarding disregarding to update when backedge taken weight is zero, note that in fixupBranchWeights(), BackedgeTakenWeight is called FallThroughWeight(?) "The weight of the edge from Latch to Header", and that

// FallThroughWeight is 0 means that there is no branch weights on original
// latch block or estimated trip count is zero.

Regarding the first meaning of 0, whoever calls fixupBranchWeights() should do so only if there were such weights on the original latch block, similar to the caller of setLoopEstimatedTripCount(). Regarding the second meaning of 0, it seem fixupBranchWeights() suffers from same +1 issue: estimating trip count to be zero when backedge taken weight is zero.

It would be good to fix and centralize the support for updating weights of loops, but such refactoring can be done as a separate follow-up patch, after landing this (accepted) patch.

This revision was automatically updated to reflect the committed changes.