This is an archive of the discontinued LLVM Phabricator instance.

LoopUnrollRuntime: Add weights to all branches
ClosedPublic

Authored by MatzeB on Aug 23 2023, 10:35 AM.

Details

Summary

Make sure every conditional branch constructed by LoopUnrollRuntime code sets branch weights.

  • Add new 1:127 weights for the conditional jumps checking whether the whole (unrolled) loop should be skipped in the generated prolog or epilog code.
  • Remove updateLatchBranchWeightsForRemainderLoop function and just add weights immediately when constructing the relevant branches. This leads to simpler code and makes the code more obvious as every call to CreateCondBr now has a BranchWeights parameter.
  • Rework formula for epilogue latch weights, to assume equal distribution of remainders and remove assert (as I was able to reach this code when forcing small unroll factors on the commandline).

Diff Detail

Event Timeline

MatzeB created this revision.Aug 23 2023, 10:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 10:35 AM
MatzeB requested review of this revision.Aug 23 2023, 10:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 10:35 AM
MatzeB retitled this revision from LoopUnrollRuntime: Add branch weights to newly created branches to LoopUnrollRuntime: Add weights to all branches.Aug 23 2023, 10:38 AM
MatzeB added subscribers: spatel, lebedev.ri.EditedAug 23 2023, 11:01 AM

I guess we may need a discussion about how to express these "nearly always taken" / "nearly never taken" branch weights. The hardcoded 1:127 ratio in the current patch is pretty arbitrary.

Looking around I see LikelyBranchWeight in llvm/Transforms/Scalar/LowerExpectIntrinsic.cpp but it's private to the pass and a warning advises to use TargetTransformInfo::getPredictableBranchThreshold(). Intuitively it feels wrong to me to have branch_weights change based on the selected target here and I am not sure I follow why LowerExpectIntrinsic gets to be an exception and not use getPredictableBranchThreshold() then...

CC @lebedev.ri @spatel

Some history in

Maybe we should expose the "LikelyBranchWeight" of LowerExpectIntrinsic.cpp in an internal LLVM header? So frontends are still forced to use the llvm.expect abstractions while LLVM-internal code can do the quicker thing and add likely branch weights directly...

mtrofin added a reviewer: xur.Aug 24 2023, 9:44 AM
mtrofin accepted this revision.Aug 24 2023, 9:48 AM

nits.

llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
797–803

should this logic be factored in a getBranchWeights(Latch, Count) - something like that?

This revision is now accepted and ready to land.Aug 24 2023, 9:48 AM
mtrofin added inline comments.Sep 1 2023, 7:48 AM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
801

one thought about the 1:127 ratio: how about we had a MDBuilder::createUnrolledLoopSkipWeights() doing the setting - more readable, API captures the intent, etc?

hoy added inline comments.Sep 1 2023, 9:54 AM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
801

Can this be computed based on the loop preheader count and the main loop trip count? IIUC, the block execution count of the preheader indicates how many times the loop is entered from outside the loop. Thus that the loop trip count divided by the preheader count indicates an average trip count per loop execution. If that is greater than the unrolling factor, then the unrolled loop should be always executed?

MatzeB added inline comments.Sep 1 2023, 12:53 PM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
797–803

and put it where? I find that having the branch weight decisions right next to the code creating the branch helps readability of the code as you don't have to jump around.

801

how about we had a MDBuilder::createUnrolledLoopSkipWeights()

As above I think this hurts readability as it moves the logic away into another file. Also if we decide this being the preferred method, then I'd rather see a mass-refactoring of relevant places rather than starting with one-offs in diffs.

Can this be computed based on the loop preheader count and the main loop trip count?

I don't see how. The old preheader (if it existed at all) will give you the ratio of zero and non-zero trip counts. But this here needs to catch the cases of trip-counts being smaller than the unroll factor...

mtrofin added inline comments.Sep 1 2023, 1:13 PM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
801

how about we had a MDBuilder::createUnrolledLoopSkipWeights()

As above I think this hurts readability as it moves the logic away into another file. Also if we decide this being the preferred method, then I'd rather see a mass-refactoring of relevant places rather than starting with one-offs in diffs.

I can see your way for the above, but here, the way I see it is that the choice of numbers and the intent behind them is not immediately obvious, so an API would make that clear. At minimum the values could be a constant that's reused? (maybe the ArrayRef overload of createBranchWeights would take that constant or something like that)

If this set of weights is this way elsewhere, doing a refactoring after makes sense, but (maybe I searched superficially) can't seem to find them other than in your recent patches on this topic, that's why I was thinking this may be the opportune moment - wdyt?

hoy added inline comments.Sep 1 2023, 1:49 PM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
801

I don't see how. The old preheader (if it existed at all) will give you the ratio of zero and non-zero trip counts. But this here needs to catch the cases of trip-counts being smaller than the unroll factor...

The old preheader can be just some block dominating the loop header, for the sake of block execution count.

Right, the branch is comparing trip count per each loop entry with the unroll factor. But to simulate that probability at compile time, I feel we can use the aggregated loop trip count (i.e., profile count) divided by the preheader count to compare with the unroll factor. Let me know if this feels wrong.

hoy added inline comments.Sep 1 2023, 3:19 PM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
801

Talked offline. The average trip count is already used to check if loop unroll is beneficial. Here we are pretty sure the unrolled loop body is going to be run, so giving a 99.9...% ratio is reasonable.

MatzeB updated this revision to Diff 555524.Sep 1 2023, 3:52 PM

factor out branch weight constants.

MatzeB marked 2 inline comments as done.Sep 1 2023, 3:53 PM
wenlei added inline comments.Sep 4 2023, 2:35 PM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
61

nit: in the comment also explain why it's unlikely for loop to not enter unrolled version at all. Like does the unroll only happen with known trip count (which is bigger than unroll factor)?

62

It's good to have constants extracted out, but when we do that, we can try to make it meaningful when someone just look at the constants without looking at the uses.

If {1,127} means *low* probability for not entering unrolled loop at all, it'd be confusing for the same {1,127} to also mean *high* probability for epilog loop to be executed.

Easiest way to fix is to add comment for each individual weight..

315

nit: you mean [0, count)?

405–408

I know this is just refactoring from updateLatchBranchWeightsForRemainderLoop, but I'm wondering if the assumption is reasonable? If we assume linear distribution for loop trip counts, then the average trip count for the remainder should be Count / 2, rather than Count - 1?

MatzeB added inline comments.Sep 11 2023, 11:04 AM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
61

in the comment also explain why it's unlikely for loop to not enter unrolled version at all. Like does the unroll only happen with known trip count (which is bigger than unroll factor)?

The header deals with the amount of iterations not fitting into the unroll factor. i.e. if the loop requires 43 iterations and you have an unroll factor of 4, then the header will do 3 iterations and we enter the unrolled loop to process the remaining 40 iterations (that are now 10 unrolled ones).

However I think this is the wrong place to explain how this pass works. It's just a constant used in the algorithm. I'd rather not try to write explanations of how the algorithm works here, we should have that at the beginning of the file, in the functions performing the transformations and policy questions you hint at are in LoopUnrollPass.cpp this file here is mostly about mechanisms.

I have a feeling what you are really asking about here is some reasoning why "127" is a good value. And honestly I don't have a deeper analysis behind this, except for looking at some small toy examples. This should be clearly better than the default 50:50 split we end up when not adding anything, so I hope we can leave tuning for later (if necessary).

62

If {1,127} means *low* probability for not entering unrolled loop at all, it'd be confusing for the same {1,127} to also mean *high* probability for epilog loop to be executed.

This is about entering a loop epilogue, not the unrolled loop. And you are reading this correctly in that it should be a *low* probability.

The comment already states this. The code either produces a prologue or an epilogue to deal with the extra iterations that don't fit the unrolled loop. Though again this seems like the wrong place to explain the intricacies of this pass, it's just some constants...

MatzeB added inline comments.Sep 11 2023, 11:16 AM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
405–408

CC @hoy who originally added this (D83187).

I think it would be (Count / 2) - 1 for linear distribution. Guess I can change it as part of this patch...

MatzeB added inline comments.Sep 11 2023, 11:56 AM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
405–408

Given there is a pre-header to rule out zero-trip counts and this being a loop backedge, the average should be (Count - 2) / 2.

wenlei added inline comments.Sep 11 2023, 1:24 PM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
61

The header deals with the amount of iterations not fitting into the unroll factor. i.e. if the loop requires 43 iterations and you have an unroll factor of 4, then the header will do 3 iterations and we enter the unrolled loop to process the remaining 40 iterations (that are now 10 unrolled ones).

This is not answering the question, instead this is explaining how the pass works.

However I think this is the wrong place to explain how this pass works.

We do not need to explain how the pass works in order to explain why specific values are chosen. :) We can assume some knowledge of how the pass works from the readers.

I have a feeling what you are really asking about here is some reasoning why "127" is a good value.

Partially. But what I'm really asking is to put an explanation for the value you chose in the comment. Just saying this is generally unlikely, and it's a ball park estimate, not yet tuned can also be helpful (so people know they have more freedom to change it).

I actually think we should put something like Hongtao's comment here.

The average trip count is already used to check if loop unroll is beneficial. Here we are pretty sure the unrolled loop body is going to be run, so giving a 99.9...% ratio is reasonable.

62

This is about entering a loop epilogue, not the unrolled loop.

For the unrolled loop, I was referring to the UnrolledLoopHeaderWeights above which shares the same value {1,127}. Sorry for causing confusion.

And you are reading this correctly in that it should be a *low* probability.

No, I read it as it's a high probability to enter loop epilog. Given that we have 1 / UF chance of loop trip count being exact multiples of UF, we enter loop epilog with a probability of (UF - 1) / UF. Am I missing something here?

And btw, if I'm missing something obvious here, the answer doesn't need to go into comments.

Though again this seems like the wrong place to explain the intricacies of this pass, it's just some constants...

I guess I care less where this is explained. But I think there needs to be an explanation for the values chosen somewhere. Currently there is no explanation.

However, given the constants are now extracted out, I think the most natural way is to have the explanation go along with the constants/values.

We don't need explain the intricacies of this pass, but IMHO we need enough context for people to understand why these values were chosen, and how confident original author was about the chosen value. For such explanation, we can assume certain knowledge of the pass. Staring at some values and scratching our heads just isn't a good experience, so I'd err on the side of over-communicate when it comes to chosen constants.

MatzeB updated this revision to Diff 556479.Sep 11 2023, 1:29 PM

Rework formula for prolog/epilog loop-backedge weights as suggested by @wenlei .

MatzeB marked 2 inline comments as done.Sep 11 2023, 1:32 PM
MatzeB added inline comments.Sep 11 2023, 1:45 PM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
62

Sorry if this is confusing, the test here is whether the loop trip count is so small that we skip the unrolled loop and enter the epilogue immediatey. Trying to make the comment more clear now...

MatzeB updated this revision to Diff 556480.Sep 11 2023, 1:46 PM

Tweak comments for weight constants.

MatzeB edited the summary of this revision. (Show Details)Sep 11 2023, 1:49 PM
wenlei accepted this revision.Sep 11 2023, 1:58 PM

lgtm, thanks!

llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
405–408

Makes sense. I didn't consider the fact that it's a backedge.

This revision was landed with ongoing or failed builds.Sep 11 2023, 2:26 PM
This revision was automatically updated to reflect the committed changes.
MatzeB marked 3 inline comments as done.