This is an archive of the discontinued LLVM Phabricator instance.

Fix branch weight in FoldCondBranchOnValueKnownInPredecessor pass in SimplifyCFG
Needs ReviewPublic

Authored by LukeZhuang on Aug 5 2022, 1:19 PM.

Details

Summary

The patch is intended to fix this test case:

extern void bar();
void foo(char a, char b, char c) {
  char aa = 2 * a, bb = 2 * b, cc = 2 * c;
  if (__builtin_expect_with_probability(((aa == 2) || (bb == 2) || (cc == 0)), 1, 0)) {
    bar();
  }
}

The user defines the probability of calling function bar as 0%.

Before the second SimplifyCFG pass, the LLVM IR is as follows:

define dso_local void @foo(i8 noundef signext %0, i8 noundef signext %1, i8 noundef signext %2) local_unnamed_addr #0 {
  %4 = and i8 %0, 127
  %5 = icmp eq i8 %4, 1
  %6 = and i8 %1, 127
  %7 = icmp eq i8 %6, 1
  %8 = or i1 %5, %7
  br i1 %8, label %12, label %9

9:                                                ; preds = %3
  %10 = and i8 %2, 127
  %11 = icmp eq i8 %10, 0
  br label %12

12:                                               ; preds = %9, %3
  %13 = phi i1 [ true, %3 ], [ %11, %9 ]
  br i1 %13, label %14, label %15, !prof !5

14:                                               ; preds = %12
  call void (...) @bar() #2
  br label %15

15:                                               ; preds = %14, %12
  ret void
}

!5 = !{!"branch_weights", i32 1, i32 2147483647}

Here, the probability of calling function bar() is still 0%, since the probability of executing block12 is 100%, and the probability from block12 to block14 is 0%.

Then after the FoldCondBranchOnValueKnownInPredecessor pass, the entry block directly points to block14, and the IR becomes as follows:

define dso_local void @foo(i8 noundef signext %0, i8 noundef signext %1, i8 noundef signext %2) local_unnamed_addr #0 {
  %4 = and i8 %0, 127
  %5 = icmp eq i8 %4, 1
  %6 = and i8 %1, 127
  %7 = icmp eq i8 %6, 1
  %8 = or i1 %5, %7
  br i1 %8, label %12, label %9

9:                                                ; preds = %3
  %10 = and i8 %2, 127
  %11 = icmp eq i8 %10, 0
  br i1 %11, label %12, label %13, !prof !5

12:                                               ; preds = %3, %9
  call void (...) @bar() #2
  br label %13

13:                                               ; preds = %12, %9
  ret void
}

!5 = !{!"branch_weights", i32 1, i32 2147483647}

Both entry block and block9 have BranchInst and they both point to block12, so after the performBranchToCommonDestFolding pass, they are merged. And the branch weight is recalculated.

So the final IR is as follows:

define dso_local void @foo(i8 noundef signext %0, i8 noundef signext %1, i8 noundef signext %2) local_unnamed_addr #0 {
  %4 = and i8 %0, 127
  %5 = icmp eq i8 %4, 1
  %6 = and i8 %1, 127
  %7 = icmp eq i8 %6, 1
  %8 = or i1 %5, %7
  %9 = and i8 %2, 127
  %10 = icmp eq i8 %9, 0
  %11 = or i1 %8, %10
  br i1 %11, label %12, label %13, !prof !5

12:                                               ; preds = %3
  call void (...) @bar() #2
  br label %13

13:                                               ; preds = %3, %12
  ret void
}

!5 = !{!"branch_weights", i32 -2147483647, i32 2147483647}

Now the probability of calling function bar() is over 50%, which is very different from what the user defined.

We believe the issue is caused by the branch weight not being updated in the FoldCondBranchOnValueKnownInPredecessor pass. The optimization that this pass does can be simplified to this case:

before:
  PredBB  other2
       \  /
        \/
other1  BB
     \  /\
      \/  \
 RealDest  other3

The block execution frequency of RealDest here can be represented as:

Formula 1:
  BlockFreq(RealDest)
= BlockFreq(BB) * P(BB->RealDest) + other1
= (BlockFreq(PredBB) * P(PredBB->BB) + other2) * P(BB->RealDest) + other1
= BlockFreq(PredBB) * P(PredBB->BB) * P(BB->RealDest) + other2 * P(BB->RealDest) + other1

in which P(A->B) represents the probability from blockA to blockB.

Then after the pass, the graph is updated to:

after:
  PredBB   other2
       |     /
       |    /
other1 |   BB
     \ |   /\
      \|  /  \
   RealDest    other3

PredBB is redirected to RealDest if the BranchInst in BB is always pointing to RealDest when the condition is from PredBB.
The new block execution frequency of RealDest is:

Formula 2:
  BlockFreq(RealDest)
= BlockFreq(PredBB) * newP(PredBB->RealDest) + BlockFreq(BB) * P(BB->RealDest) + other1

in which BlockFreq(BB) is equal to other2 since PredBB no longer points to BB. So it can be represented as:

Formula 3:
  BlockFreq(RealDest)
= BlockFreq(PredBB) * newP(PredBB->RealDest) + other2 * P(BB->RealDest) + other1

Thus, to match Formula 1 and Formula 3, we need to make newP(PredBB->RealDest) in the new graph equal to P(PredBB->BB) * P(BB->RealDest) in the original graph.

One way is to scale all the branch weights from PredBB but not to BB by multiplying a factor x.
Supposing:
(1) A means the sum of all branch weights from PredBB to BB, which is PTITotalTakenWeight in the source code
(2) B means sum of those from PredBB but not to BB, which is PTITotalWeight-PTITotalTakenWeight in the source code
(3) S means the sum of all branch weights from PredBB, which is PTITotalWeight in the source code and equal to A+B
(4) C means the branch weight from BB to RealDest, which is BITakenWeight in the source code
(5) D means the branch weight from BB to the other successor, which is BINotTakenWeight in the source code
So:

   newP(PredBB->RealDest) = P(PredBB->BB) * P(BB->RealDest)
=> A/(A+x*B) = A/(A+B) * C/(C+D)
=> x = 1 + (S/B) * (D/C)

Then for each branch weight from PredBB but not to BB, we do this transformation:

   NewWeight = OldWeight * x
=> NewWeight = OldWeight * (1 + (S/B) * (D/C))
=> NewWeight = OldWeight + (S*OldWeight/B) * (D/C)

which will prevent overflow (see source code for detailed information)

Also, since this change may give branch weights to some BranchInst/SwitchInst that do not have them originally, we skip cases when just giving equal branch weight (default behavior) to them.

Diff Detail

Event Timeline

LukeZhuang created this revision.Aug 5 2022, 1:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2022, 1:19 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
LukeZhuang requested review of this revision.Aug 5 2022, 1:19 PM
lattner resigned from this revision.Aug 5 2022, 3:56 PM

Hello, may I ask do anyone have comments on this change? or any suggestions of fixing this in better ways? Thanks!

nikic added a comment.Aug 9 2022, 9:16 AM

This is a jump threading optimization -- is it possible to reuse the branch weight adjustment code from the JumpThreading pass? The updateBlockFreqAndEdgeWeight() method looks relevant.

This is a jump threading optimization -- is it possible to reuse the branch weight adjustment code from the JumpThreading pass? The updateBlockFreqAndEdgeWeight() method looks relevant.

Hi Nikic, thanks for the comments! Actually I have two questions about this:

  1. I just looked at the code, JumpThreading pass seems did the same thing as FoldCondBranchOnValueKnownInPredecessor in SimplifyCFG. May I ask why we need both of them originally? Or is this just my misunderstanding
  2. I previously also thinking of using BlockFrequencyInfo but I found it's not used and maintained in SimplifyCFG pass. Is that means we need to add it?

Hello guys, sorry for bothering again. May I ask could anyone help me look at this change or give suggestions to my previous question?
Because fixing this issue is important to us. Thank you very much!

RKSimon added inline comments.Aug 19 2022, 6:59 AM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1099

'Keep halving' suggests there should be a while loop?

LukeZhuang added inline comments.Aug 19 2022, 7:06 AM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1099

Oh, this function and comment is copied from the FitWeights above. Yeah I agree that it somewhat does not match with comments. I prefer to change comment to like "Fit weight so that ..."

Hi, thanks for working on this.

I'm probably not the most qualified reviewer in this area, but I wanted to supply some feedback. In particular, I've focused on places where I felt like this patch was hard to follow, and have tried to make suggestions that would have aided my understanding.

I also agree w/ @nikic that this looks related to JumpThreadingPass::updateBlockFreqAndEdgeWeight. Do you have thoughts on why that is/isn't relevant?

As a side note, you may get better responses if the # of reviewers is smaller. People tend to assume someone else will review a change if there are lots of people in the list. Moreso if anyone else responds. That was certainly the case w/ me. :)

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
879

I don't think you need this helper function.

the setBranchWeights(SwitchInst*,...) version should be able to handle these cases if you just modify the signature to take Instruction* instead and add the assert below.

The reason being is that CreateBranchWeights() only offers a convenient api for branch/select that takes two weights to avoid writing boilerplate at callstites. Internally it just calls the version of CreateBranchWeights() that takes an ArrayRef.

In this case, I think in this case it’s better to follow the same approach and avoid introducing a redundant function implementation.

1100

I’m fairly certain we don’t want to scale the weights down so that the sum of all weights fits in a uint32.

For one there are all kinds of things that assume they can be larger. Off the top of my head, Optimization Remarks are often filtered by hotness (i.e. branch weight). But IIRC there are also transforms that only trigger when the weights exceed some threshold. Those may be better served by using a probability, but my recollection is that they do not. Though on this point someone w/ more comprehensive knowledge in this area should probably weigh in. Still, I'm skeptical that we want to do this.

I'm a bit surprised that scaling weights down like this does not affect any tests. That may be highlighting some untested paths more than that this change in behavior is doing the right thing.

2998

What is PTI? Is this a common term from somewhere? If it's the predecessor, then Pred or PredInst seems to communicate more.

3000–3001

Can you rephrase this? I'd like this to be clear to all readers, and it's currently a bit hard to parse.

3003–3005

I'd suggest avoiding variable names from the implementation here where you're documenting your algorithm and approach.

Can you also break each variable def/description onto its own line to make reading a bit easier?

3007

This function seems to go to a lot of trouble to avoid overflowing uint32_t, which is good, but it seems like things would be significantly simpler if you did the calculations with uint64_t and then scaled the weights the same way we do elsewhere. If that is common functionality, then it can be moved to elsewhere, like IR/ProfDataUtils.

Are there other reasons other reasons, concerns, or context here that I'm missing?

3012

I don't think it's correct to assert this, is it? Weights are allowed to be 32-bits, and summing them could overflow uint32_t, and is an expected possibility (see ProfDataUtils.cpp). So why is this case different?

3017

Can you rename this Idx, or something to more clearly indicate this is an index? It's extremely common in our codebase to use I for instructions, and in this case I'm finding myself reading the code incorrectly.

This compounded below when you say:

// Branch "I" is also one of the PTI not_taken branch and that means
3022

This is significantly easier to follow than the code below. Why not use these names?

3027

If you need an intermediate variable, please call out what it is in your equations.

We can all figure it out, but it helps to be either obvious from the naming, or to note in the comment that you're calculating (totW*Weights[I]/totNTW).

3045

OK, so PTI is the Predecessor's Terminator Inst... I still think we need to name some of this differently. PTI is probably fine for reffing the terminator itself when you care about it being a terminator. But in most cases here you're really talking about incoming branches and their weights. PTIWeights doesn't communicate that IMO.

3048–3062

can't you simplify this significantly by treating branch and switch uniformly? We have APIs in ProfDataUtils just for that...

You should be able to avoid most of the dyn_casts and the different conditions almost entirely, right?

https://github.com/llvm/llvm-project/blob/a7441289e2eb8ccf71c1a60b71d898069e82b22b/llvm/lib/IR/ProfDataUtils.cpp#L111

Assigning default weights can also be made uniform.

3052
3161–3164
3161–3164
3171–3172

Are we inverting the branch condition? why are we swapping? I think I've missed something here...

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
12–13

I don't know if these comments(here and elsewhere) make sense outside of this patch, since 'Before' and 'After' are only relevant in the context of your change...

I think if you want to more thoroughly explain the purpose of the test, you'll need to provide context. A summary of the incorrect behavior and what a regression looks like is probably enough. You can also ref this patch, and anyone coming along later can easily understand why this test exists, and what it's checking.

WDYT?

Hi Paul, thank you very much for the suggestions! I'm working on improving the code according to your comments.
For your question about JumpThreading, since I'm not an expert on this, here is my understeading:
(1) I agree that using blockfrequency(like the function that nikic suggested) is the best way to solve this issue. I was trying to use BlockFrequencyInfo at the beginning but I found this information is not maintained and no otherwhere else use this in SimplifyCFG. May I ask what do we usually do for this situation? Is that adding this in SimplifyCFG?
(2) Jumpthreading seems is doing the same thing as FoldCondBranchOnValueKnownInPredecessor in SimplifyCFG, may I ask do you know(or someone else know) why we need both of them? And jumpthreading seems is not in -O1 optimization. I guess if we can use jumpthreading, this issue would not exist at the beginning.
Any way, thank you for the comments!

(1) I agree that using blockfrequency(like the function that nikic suggested) is the best way to solve this issue. I was trying to use BlockFrequencyInfo at the beginning but I found this information is not maintained and no otherwhere else use this in SimplifyCFG. May I ask what do we usually do for this situation? Is that adding this in SimplifyCFG?

I'm not completely sure I follow you here. Are you asking if it's OK to use BlockFrequencyInfo + BlockProbabilityInfo in SimplifyCFG? I don't see why not if that is the correct thing to do and avoids complexity. Those are analysis passes, so should be safe to take a dependency on. SimplifyCFG is going to invalidate them though, so it's going to introduce overhead. IMO correctness is more important though, so I wouldn't worry about running a (usually quick) set of analysis passes.

The other thing to consider is if JumpThreading should have that functionality extracted to some common place so SimplifyCFG can share the implementation. That would be my preference, but there may be reasons to avoid that.

(2) Jumpthreading seems is doing the same thing as FoldCondBranchOnValueKnownInPredecessor in SimplifyCFG, may I ask do you know(or someone else know) why we need both of them? And jumpthreading seems is not in -O1 optimization. I guess if we can use jumpthreading, this issue would not exist at the beginning.

I'm not an authority on this, so please take my comments here with a grain of salt.

Jump threading is a pretty well known type of optimization. The change in the CFG you're patching up is an example of jump threading: find a block that always jumps to another and bypass the intermediate block. It's simple/safe enough that the maintainers of SimplifyCFG felt that it belonged here. This can be a pretty profitable optimization, so I'm not surprised that's the case.

It's my understanding that the JumpThreadingPass does more sophisticated things than SimplifyCFG, and you may not always want to do that. In some cases, those optimizations can make loops irreducible, so LLVM has historically been conservative w/ it's jump threading implementation. I also believe that's why it only runs a higher optimization levels, but someone with more familiarity with this part of LLVM can probably provide a more complete answer than I can.

Any way, thank you for the comments!

Happy to help, but, again, I'm not sure I'm the most qualified reviewer for this...

BTW, did you look into the rest of SimplifyCFG to see if there are more systemic issues w/ how branch weights are propagated? If there are, a more general solution/refactor may be warranted, and I would advise you to bring it up on discourse to get more interest/feedback.

Hi @paulkirth, sorry for the late reply. If I understand it correctly, BlockFrequencyInfo can only be used with profile information. I have found the original author of updateBlockFreqAndEdgeWeight(https://reviews.llvm.org/D10979 and https://reviews.llvm.org/D15519), that's why the author added if (HasProfileData) and only do the fix when have profile data.
However, this code is 7 years ago and we added __builtin_expect_with_probability in 2020. This probability issue(that example at the beginning of my description) can also happen even without profile data, the user will find the probability of executing bar is not expected. So it’s not sufficient for us to only do this with profile data. This is our concern about using BlockFrequencyInfo.
And again thank you very much for the kind and useful comments! I think all of them are reasonable and is working on it.

LukeZhuang added inline comments.Aug 24 2022, 8:35 AM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1100

Hi, I have another question about this, in D10979 which I mentioned previously, the author wrote something similar: normalizeEdgeWeights which also scale the sum to be within UINT32_MAX. But later in D15519 he changed all of weight to probability. I guess he had changed to this because of probably the same reason as you mentioned, and with probability we do not need to care about integer overflow.
However, currently no other place in SimplifyCFG is doing CFG fix using probability, all of them are integer weight calculation. The most typical one is performBranchToCommonDestFolding. So I'm concerned about consistency, may I ask is that okay to do probability calculation here?

Hi @paulkirth, sorry for the late reply. If I understand it correctly, BlockFrequencyInfo can only be used with profile information. I have found the original author of updateBlockFreqAndEdgeWeight(https://reviews.llvm.org/D10979 and https://reviews.llvm.org/D15519), that's why the author added if (HasProfileData) and only do the fix when have profile data.

I'm not sure that was the rationale looking at the original patch. I assume it was just because branch weights were assumed to not be present if profiling wasn't enabled. The bigger issue is that I think BPI and BFI passes assume loops are simplified, which I think happen in this pass. https://github.com/llvm/llvm-project/blob/f24c299e7d733c0afb9a17655201f37a43d84ce5/llvm/lib/Analysis/BlockFrequencyInfo.cpp#L9

However, this code is 7 years ago and we added __builtin_expect_with_probability in 2020. This probability issue(that example at the beginning of my description) can also happen even without profile data, the user will find the probability of executing bar is not expected. So it’s not sufficient for us to only do this with profile data. This is our concern about using BlockFrequencyInfo.
And again thank you very much for the kind and useful comments! I think all of them are reasonable and is working on it.

branch weights are how we derive probability and frequency information. __builtin_expect_with_probability still adds weights, it just adds them using the defined probability distribution. https://github.com/llvm/llvm-project/blob/f24c299e7d733c0afb9a17655201f37a43d84ce5/llvm/lib/Transforms/Scalar/LowerExpectIntrinsic.cpp#L63

As mentioned above, the bigger issue is that the BPI and BFI passes may be incompatible w/ simplifyCFG. You may not need to run the passes though. I would investigate whether you can construct/instantiate the BranchProbability/BlockFrequency directly given the input blocks and still derive the result you're after. Do note though, I haven't looked at the BPI/BFI passes, their algorithms, or the classes deeply, so this may not help. It's just that for correctly balancing/scaling weights I doubt you would need to run the full passes. That's just a suspicion though, and I may be steering you the wrong way. It's just what I would investigate.

paulkirth added inline comments.Aug 24 2022, 2:14 PM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1100

Looks like I may have steered you in the wrong direction. BPI pass is going to scale these anyway. see https://github.com/llvm/llvm-project/blob/27d666b9adc12a13f67ccde96d5b4d181f56ffc5/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L420

We should just do what they do, and since this is now a common functionality, we should extract that logic and make it reusable. ProfDataUtils is a good candidate for where that code should live, since this is about branch weights, but if there is somewhere else that makes sense, that is also fine. I'd also suggest making that a separate patch, since its kind of a code clean up that is orthogonal to this change.

Hi @paulkirth, I summarized all the issues about BFI and bring it up in discourse: https://discourse.llvm.org/t/is-that-blockfrequencyinfo-can-only-be-used-with-profile-data/64855. By David's response, I think I can start to try to add BFI in SimplifyCFG to see if it's a feasible way.

David is certainly an expert on this part of the compiler, so I expect you
to be successful using his advice.

I am also not entirely sure you need to go to those lengths though, as I
mentioned in my earlier comment. But if you find a straightforward
solution, I'll be happy to offer any help I can.

I did not look at the patch in details. At a high level, it seems that the pass itself does not use profile data to make decisions, so BPI/BFI is really not needed. Instead, directly updating the profile meta data as transformation happens is reasonable (unless BFI provides updating APIs for cases like this, which is not the case).

nikic resigned from this revision.Aug 29 2022, 1:38 AM

Regarding the duplication between JumpThreading and SimplifyCFG, this basically comes down to phase ordering. SimplifyCFG is run many times throughout the pipeline, while passes like JumpThreading run only once. For that reason, SimplifyCFG often implements some basic version of optimizations that have a more powerful implementation in a dedicated pass.

Regarding BFI/BPI, I don't think we want to use those in SimplifyCFG. I don't think we can reasonably preserve BFI though all SimplifyCFG transforms -- and I believe it also depends on DT and LI, which we don't want to have as SimplifyCFG dependencies at this time.

When I mentioned updateBlockFreqAndEdgeWeight() that was more as a pointer to code that should be doing something similar -- if it uses BFI, then we indeed shouldn't reuse it directly (but maybe there is some useful logic that can be shared -- I don't know). It is a bit odd that your code seems to be substantially more complicated than what JumpThreading does.

Anyway, I'm not really familiar with branch weight metadata, so these were really just some drive-by notes, someone who is familiar should review this.

Thank you all for the help!
@paulkirth, your earlier comments really helps and I'm improving the code now.
@davidxl, I also found it's difficult to add BFI in SimplifyCFG and may I ask could you please help me check the high level of this patch and whether this updating method makes sense?
@nikic, anyway thank you for the suggestions. I think it looks harder than JumpThreading because our final goal is to keep block frequency the same after CFG change. Jumpthreading knows BFI and need to update branch(just like already know the result and want to find the reason), but SimplifyCFG cannot use BFI so it's like doing something in a reverse manner. That's just my guess.

LukeZhuang removed a subscriber: nikic.Aug 29 2022, 3:18 PM
davidxl added inline comments.Sep 6 2022, 9:05 AM
llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
12–13

The comment can be enhanced for reader to follow. Can you add an ascii art of the CFG after simplifyCFG

90

Similarly add a cfg description post transformation.

132

Can you split out the overflow fix in a different patch?

175

Add post simplfycfg CFG

LukeZhuang marked 21 inline comments as done.
LukeZhuang removed a reviewer: nikic.

Sorry for the late reply. @paulkirth @davidxl I just updated the patch according to your comments

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
2998

I think I used this name because I found some other functions also uses it. For example, https://github.com/llvm/llvm-project/blob/49e7ef2c09facd722a29a5ad96a7f8f16e362b28/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L1390, as well as #L1096. Yeah I agree that that might be confusing if reading the definition first before its usage. I added description to it. Thanks for pointing out

3007

Currently I already did this in uint64_t. I need to add the overflow handling because in the original formula, there will be a product of three uint32_t(totW * Weights[i] * BNTW) which will overflow even in uint64_t.

3012
3171–3172

Because in the previous line of "BIHasWeights" I assumed that the "true successor" is BB->RealDest and "false successor" is the other one when calling extractBranchWeights. This line is used to swap them if "false successor" is actually BB->RealDest.
Sorry for the confusing, I also added comments to avoid misunderstanding

paulkirth added inline comments.Sep 22 2022, 10:09 AM
llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1100

IMO this seems to be a common functionality that both BranchProbabilityInfo and SimplyfyCFG need. While IMO, https://github.com/llvm/llvm-project/blob/27d666b9adc12a13f67ccde96d5b4d181f56ffc5/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L420 still seems like the right approach, whichever implementation we go with should be factored out into a common utility. That way all the parts of LLVM that need this functionality can benefit.

2998–3000

Is this an accurate way to rephrase this? It's how I read it, but it seems inconsistent w/ the comments + formula below.

3013

This seems to conflict w/ your description above. "... to be its original value times the probability of..." Implies to me that X = BTW/TotW.

Either the description above is incomplete, or this isn't the right formula to use. Can we make these consistent?

3030–3032

Do you think this edit would make it easier to follow? Is it missing any important information?

3042

nit: I is now Idx, here and elsewhere in this description

3170–3171

nit: you don't need this variable at all, since you only use it in the conditional below.

if(extractBranchWeights(...)) is probably a bit more idiomatic.

davidxl added inline comments.Oct 5 2022, 9:08 AM
llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight-overflow.ll
10 ↗(On Diff #460769)

In this case, why does the prof meta data for the switch need to change in the first place? Can the original meta data be retained?

Regarding overflow, should the branch weights be scaled down first to avoid the problem in the first place?

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
59

By looking at example only, it looks like the right way is to compute the branch weights for entry block using backward propagation.

Say entry->lor.end has weights x, and sentry->lor.rhs has weights y. Assuming entry (and lor.end) has weights 100. The following system equation holds (assumes no path sensitivity, and thus the branch weights after transformation remains the same for lor.rhs), we have

x+ y = 100

x + y/10 = 10

And it gives x = 0, y = 100

Please note that, due to the possibility of path sensitive behavior, there is no reason to believe that probability of lor.rhs without this patch is any worse than the one with the patch.

And @paulkirth, also thank you very much for the comments and I'm working on improving it

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight-overflow.ll
10 ↗(On Diff #460769)

I created this test to not testing the feature itself but to catch a potential risk in my method that may overflow if I didn't handle it appropriately.
I already scale down it in the code, but even so, I will still need to calculate the product three uint32_t, and I proved that the product of two of them will not exceed UINT_MAX. So this test is intended for this.

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
59

Hi David, thanks for the comments!
I have a question on it. The backward propagation require us to know the block frequency info(or especially know the whole CFG), may I ask is that correct? Otherwise I think we cannot know lor.end has the same weight as entry, and if.then has 10% weights of entry.
So I think that goes back to our previous discussion. The first issue is the transformation in SimplifyCFG does not know either BlockFrequencyInfo nor the whole CFG. The second issue is this branch weight maintenance can also happen with static code prediction itself(using __builtin_expect in my example), in which case we do not have frequency info from dynamic analysis such as PGO.
Please correct me if I understand it incorrectly.

davidxl added inline comments.Oct 7 2022, 9:07 AM
llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
59

The branch weights are relative numbers (to compute branch probability), comparison between weights between nodes are meaningless. BFI uses BPI information to compute block frequencies. Block frequencies are relative but at function level -- comparing block frequency across functions are not meaningful. With function entry count (real profile) available, block frequency can be converted to block count which is globally (whole program) valid and comparable.

In other words, with a single entry/exit region, you should be able to propagate branch weights (to be more precise, to propagate taken/notaken ratio, or probability, not the raw weight values). You can assume entry node with any frequency for that matter.

The key problem with back propagation is that it assumes the probability of lor.lhs does not change after the splitting. This assumption may not hold in reality. It is equally reasonable to assume a probability at entry node and do forward propagation.