Vectorized loop processes VFxUF number of elements in one iteration thus total number of iterations decreases proportionally. In addition epilog loop may not have more than VFxUF - 1 iterations. This patch updates profile information accordingly.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
LGTM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4055 | Nit: VFxUF - 1 | |
4061–4063 | Style: it is usually advised to turn such conditions into early exits, it would reduce required indentation and slightly improve readability. | |
4082 | Maybe introduce a new variable for this value (like EpilogueTakenCount)? Right now it's a bit surprising that OrigSomething is being changed. Same goes to OrigFallThroughCount. |
I realized that current implementation has a flaw and and we should take into account that actual number of iterations is one greater than back edge taken count. In addition I believe that current structuring of calculations is easier for understanding.
Adding a few comments. Would be good to generalize and apply also to loop unroll (and jam).
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4064 | OrigFallThroughCount can still be either the exit count or the continue-to-next-iteration count, according to the code below. Wait to test if its zero until we know what it stands for? | |
4076 | Better use distinct names, e.g., OrigExitCount and OrigBackedgeTakenCount, than continue to call them Taken and FallThrough. Perhaps use Weight instead of Count, to denote total profile frequencies, as the latter is used elsewhere to denote the actual per-invocation TripCount. | |
4079 | bel[l]ow | |
4081 | How about "OrigAverageTripCount"? Explanation about its computation: | |
4083 | How about VecAverageTripCount = OrigAverageTripCount / (VF * UF); | |
4087 | Just to clarify, maintaining branch frequencies through optimizations is best-effort and imprecise - a total weight that does not divide VF*UF implies that the trip count of at-least one invocation did not divide VF*UF, not necessarily all of them; w/o considering also the distribution of trip counts in addition to their sum. Setting PRIterCount = 0 and VecAverageTipCount = round(OrigAverageTripCount / (VF*UV)) when Cost->foldTailByMasking() is probably the best that can be done. The former is redundant given that it applies to dead code, and the latter should perhaps apply to all cases, in general. | |
4091 | There's also the special case of requiresScalarEpiloque() where 0 < PEIterCount <= VF*UF for each invocation of the loop, and hence the average is also strictly positive FWIW. But best keep the approximation general instead of trying to improve it, given general lack of information. | |
4098 | This assumes the number of times the vector loop will be reached is equal to the number of times the original scalar loop was reached (OrigFallThrougCount). This holds is Cost->foldTailByMasking(), but otherwise invocations whose trip count < VF*UF will bypass the vector loop (and also == VF*UF if requireScalarEpilogue()), plus other run time guards. | |
4107 | Similar to above comment, invocations whose trip count divides VF*UF will bypass the scalar remainder loop (w/o foldTailByMasking nor requireScalarEpilogue), so in general PEFallThroughCount <= OrigFallThroughCount. | |
llvm/test/Transforms/LoopVectorize/check-prof-info.ll | ||
4 | May want to also check with UF>1. |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4064 | Good catch. Thanks! | |
4076 | I fixed names. But I don't see reasons to use different variables here (if this is what you meant) | |
4079 | fixed | |
4081 | Ok. Turned your explanation to a comment. | |
4091 | Agree. Let me remove this special case then. | |
4098 | Please note that VecFallThrough is zero initially and set to OrigFallThrougCount only if vector loop is expected to be executed (VecIterCount > 0) | |
4107 | Same explanation as for the above. | |
llvm/test/Transforms/LoopVectorize/check-prof-info.ll | ||
4 | I replaced masked case since we don't do anything special for it now. |
Still think it would be better to provide this as a standalone function in Transforms/Utils/LoopUtils, for potential benefit of loop unroll (and jam) passes in addition to LV. Having agreed to ignore foldTail and requiresScalarEpilog, there's nothing vectorization-specific to do here. There's still an issue though with the fact that LV may use the scalar loop for both the remaining TC%(VF*UF) iterations when running the vector loop, and for all TC iterations when runtime guards bypass the vector loop. In absence of information, each such guard could be assigned 0.5 probability, or one could be aggressively optimistic and hope vector loop is always reached. In any case this deserves a comment.
Suggesting further variable name changes for the three Orig, Unrolled, and Remainder Loops, each having a LoopEntry==LoopExit edge weight, a Backedge weight, a HeaderBlock weight, and an AverageTripCount. The actual weights are recorded as TrueVal and FalseVal of the latch branches.
Patch needs to be clang-format'ed.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4053 | "is less ... than original TC" >> "is smaller than the original TC by a factor of VFxUF" | |
4060 | OrigTakenWeight >> OrigBackedgeTakenWeight or OrigBackedgeWeight ? | |
4063 | OrigBackBranchI >> OrigLoopLatchBranch ? | |
4066 | It seems clearer to call extractProfMetadata(TrueVal, FalseVal) and then set BackedgeTaken and LoopExit/Entry weights according to if (OrigLoopLatchBranch->getSuccessor(0) == OrigLoop->getHeader()), following LoopUtil's getLoopEstimatedTripCount(). Analogously for createBranchWeights(TrueVal, FalseVal). In any case, better rename "IsTrueBackEdge*Loop". | |
4080 | Patch needs to be clang-format'ed | |
4087 | Instead of providing the explanation in a comment, seems better to implement the code this way, leaving the +1 for the compiler to optimize. I.e., const uint64_t OrigHeaderBlockWeight = OrigBackedgeTakenWeight + OrigLoopEntryWeight; const unit64_t OrigAverageTripCount = OrigHeaderBlockWeight / OrigLoopEntryWeight; | |
4094 | Better rename/expand "PE". | |
4098 | In the general context, "Vec" >> "UnrolledLoop". | |
4100 | How about if (UnrolledLoopAverageTripCount > 0) { UnrolledLoopEntryWeight = OrigLoopEntryWeight; uint64_t UnrolledLoopHeaderWeight = UnrolledLoopAverageTripCount * UnrolledLoopEntryWeight; // Analogous to computing OrigLoopAverageTripCount from Header and Entry weights above. UnrolledLoopBackedgeWeight = UnrolledLoopHeaderWeight - UnrolledLoopEntryWeight; } leaving the -1 optimization to the compiler. | |
4106 | PETakenCount >> RemainderLoopBackedgeWeight |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4066 | Don't feel convinced. My point would be that extra variables and conditional reassignments make the code less readable. I think this is very subjective thing. | |
4100 | That will make computations less stable to overflow. Personally I feel the way it's written today has the same level of complexity for understanding. |
llvm/include/llvm/Transforms/Utils/LoopUtils.h | ||
---|---|---|
374 | The fact that OrigLoop is both the original loop containing the original profile weights, and acts as the RemainderLoop dedicated to leftover iterations, should be clarified. Alternatively, this utility can receive three loops: OrigLoop, UnrolledLoop and RemainderLoop, leaving it to the caller to decide if to pass OrigLoop also as RemainderLoop. Would probably be clearer to start with UnrolledLoop receiving weights that reflect TC/UF iterations, and then OrigLoop which receives weights that reflect the remaining TC%UF iterations. | |
376 | U[n]rolledLoop, several occurrences. "\UF" >> "\p UF" | |
llvm/lib/Transforms/Utils/LoopUtils.cpp | ||
1102 | Worth commenting that OrigLoopEntryWeight also holds OrigLoopExitWeight, which is more clearly the weight associated with the (exit direction of the) latch branch. | |
1110 | UnrolledBBI >> Unrolled[Loop]LatchBranch, as in OrigLoopLatchBranch. As the names end up overflowing lines, can use Orig, Unrolled and Remainder to stand for OrigLoop, UnrolledLoop and RemainderLoop; i.e., taking "Loop" out. | |
1113 | VecLoop >> UnrolledLoop | |
1123 | Can drop the 'const', for consistency; these temporaries are obviously const's. | |
1126 | Note that this is rounding down. Can add half of the denominator to the nominator before dividing in order to round more accurately; this is what getLoopEstimatedTripCount() does, but it seems to be off by 1 as it computes BackEdgeTakenWeight / LoopEntryWeight rounded to nearest, instead of HeaderBlockWeight / LoopEntryWeight rounded to nearest... Simply call OrigAverageTripCount = getLoopEstimatedTripCount(OrigLoop)? Perhaps having a setLoopEstimatedTripCount(Loop, EstimatedTripCount, EstimatedEntryWeight) would help fold the identical treatment of UnrolledLoop and RemainderLoop into one function, which also takes care of figuring out the True/False vs. Backedge/Exit directions? | |
1128 | U[n]rolledAverageTripCount | |
1138 | Seems slightly more logical to first set UnrolledLoopEntryWeight, and then using it set UnrolledLoopBackedgeWeight. | |
1147 | ditto | |
1160 | (This actually replaces the old profile metadata with the new one.) | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
118 | Is this include still needed here? | |
3486 | Comment below should start with a short sentence explaining that profile weights associated with the original loop are now distributed among the vector and scalar loops. |
llvm/lib/Transforms/Utils/LoopUtils.cpp | ||
---|---|---|
1102 | IMHO instead of trying to clarify with a comment we better find self descriptive name for such a simple and commonly used thing. Strictly speaking OrigLoopEntryWeight != OrigLoopExitWeight. Do you find OrigBackEdgeExitWeight good enough? | |
1110 | Removed "Loop" from most names to make them a little shorter. | |
1126 | This "off by 1" stopped me from using it in the first place since that could be important in some cases. OK, let's reuse getLoopEstimatedTripCount. To be able to do that there are some changes to getLoopEstimatedTripCount. |
Thanks for making all the changes! More comments inline.
llvm/lib/Transforms/Utils/LoopUtils.cpp | ||
---|---|---|
693–694 | Comment what this new function is for. Rename (see below)? Retain "Support loops ..." comment, added in D64553? | |
730 | dyn_cast >> cast Perhaps update above function to do here something like | |
742 | Thanks for taking care of this fix to improve accuracy of estimated TC! | |
763 | ditto (dyn_cast >> cast, ...) | |
764 | Better check similar to above if (LatchBR->getSuccessor(0) != L->getHeader()) | |
1098 | "the \p UnrolledLoop \p RemainderLoop" >> | |
1102 | May also be worthwhile asserting that UF is positive (or greater than 1?) | |
1102 |
Definitely agree with (the preference of) finding self descriptive variable names.
How so, given that OrigLoop has a single-entry, and an "expected" single-exit: there may be other "side/deopt" exits, but these are expected to have zero weight. E.g., when computing LoopHeaderWeight above, adding LoopEntryWeight (instead of LoopExitWeight) to BackEdgeTakenWeight seems more logical.
Strictly speaking, a BackEdge cannot exit: it is an edge going from a latch block to a header block, and "BackEdgeTaken" is the number of times this edge is taken=traversed, which equals the number of times its latch branch is "taken" rather than "falls-thru" (if it's true direction points to the header). Perhaps LoopEntryExitWeight or LoopInvocationWeight could be used instead/in addition. | |
1103 | are expected to be distinct | |
1126 |
Indeed, best fix and reuse, in a dedicated patch as raised above, thereby isolating impact on such "some cases". | |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
3495 | "is not taken into account as unlikely case" >> | |
llvm/test/Transforms/LoopVectorize/check-prof-info.ll | ||
7 | Tests targeting x86 need to reside in LoopVectorize/X86 |
llvm/lib/Transforms/Utils/LoopUtils.cpp | ||
---|---|---|
693–694 | The comment is related to get/setLoopEstimatedTripCount and still there... | |
730 | I was thinking about that in the first place but didn't come up with a "good" enough name. I can go with that name if you like :-) | |
742 | Ok. | |
1102 | I think we better support 1 which could be used in some corner cases.... | |
1102 |
For infinite loop entry is 1 while exit is 0 :-). I understand this is extreme we will never meet but still....
That's why I used TakenCount and FallThroughCount in the very first version what perfectly matches your description. I don't feel we are getting any better names with more iterations.... Perhaps LatchCycleWeight - number of times we go to loop header from the latch and LatchExitWeight - number of times we go to loop exit from the latch?
I think LoopEntryExitWeight may be confusing.... |
This looks good to me, thanks!
llvm/lib/Transforms/Utils/LoopUtils.cpp | ||
---|---|---|
779 | The last part could call fixupBranchWeights() if moved here from Transforms/Utils/LoopUnrollPeel.cpp | |
llvm/test/Transforms/LoopVectorize/tripcount.ll | ||
211 | Following this, to clarify: |
llvm/lib/Transforms/Utils/LoopUtils.cpp | ||
---|---|---|
779 | That would require to change implementation for fixupBranchWeights since it disregards to update when back edge taken count is zero. | |
llvm/test/Transforms/LoopVectorize/tripcount.ll | ||
211 | I will add this text to the test. I that what you wanted (just not sure :-))? |
llvm/lib/Transforms/Utils/LoopUtils.cpp | ||
---|---|---|
779 | Agreed. Regarding disregarding to update when backedge taken weight is zero, note that in fixupBranchWeights(), BackedgeTakenWeight is called FallThroughWeight(?) "The weight of the edge from Latch to Header", and that // FallThroughWeight is 0 means that there is no branch weights on original // latch block or estimated trip count is zero. Regarding the first meaning of 0, whoever calls fixupBranchWeights() should do so only if there were such weights on the original latch block, similar to the caller of setLoopEstimatedTripCount(). Regarding the second meaning of 0, it seem fixupBranchWeights() suffers from same +1 issue: estimating trip count to be zero when backedge taken weight is zero. It would be good to fix and centralize the support for updating weights of loops, but such refactoring can be done as a separate follow-up patch, after landing this (accepted) patch. |
The fact that OrigLoop is both the original loop containing the original profile weights, and acts as the RemainderLoop dedicated to leftover iterations, should be clarified. Alternatively, this utility can receive three loops: OrigLoop, UnrolledLoop and RemainderLoop, leaving it to the caller to decide if to pass OrigLoop also as RemainderLoop.
Would probably be clearer to start with UnrolledLoop receiving weights that reflect TC/UF iterations, and then OrigLoop which receives weights that reflect the remaining TC%UF iterations.