This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
22/30
SimplifyCFG.cpp
-
test/Transforms/SimplifyCFG/
-
Transforms/
-
SimplifyCFG/
5/8
fold-cond-bi-fix-weight.ll

Differential D131287

Fix branch weight in FoldCondBranchOnValueKnownInPredecessor pass in SimplifyCFG
Needs ReviewPublic

Authored by LukeZhuang on Aug 5 2022, 1:19 PM.

Download Raw Diff

Details

Reviewers

RKSimon
davidxl
paulkirth

Summary

The patch is intended to fix this test case:

extern void bar();
void foo(char a, char b, char c) {
  char aa = 2 * a, bb = 2 * b, cc = 2 * c;
  if (__builtin_expect_with_probability(((aa == 2) || (bb == 2) || (cc == 0)), 1, 0)) {
    bar();
  }
}

The user defines the probability of calling function bar as 0%.

Before the second SimplifyCFG pass, the LLVM IR is as follows:

define dso_local void @foo(i8 noundef signext %0, i8 noundef signext %1, i8 noundef signext %2) local_unnamed_addr #0 {
  %4 = and i8 %0, 127
  %5 = icmp eq i8 %4, 1
  %6 = and i8 %1, 127
  %7 = icmp eq i8 %6, 1
  %8 = or i1 %5, %7
  br i1 %8, label %12, label %9

9:                                                ; preds = %3
  %10 = and i8 %2, 127
  %11 = icmp eq i8 %10, 0
  br label %12

12:                                               ; preds = %9, %3
  %13 = phi i1 [ true, %3 ], [ %11, %9 ]
  br i1 %13, label %14, label %15, !prof !5

14:                                               ; preds = %12
  call void (...) @bar() #2
  br label %15

15:                                               ; preds = %14, %12
  ret void
}

!5 = !{!"branch_weights", i32 1, i32 2147483647}

Here, the probability of calling function bar() is still 0%, since the probability of executing block12 is 100%, and the probability from block12 to block14 is 0%.

Then after the FoldCondBranchOnValueKnownInPredecessor pass, the entry block directly points to block14, and the IR becomes as follows:

define dso_local void @foo(i8 noundef signext %0, i8 noundef signext %1, i8 noundef signext %2) local_unnamed_addr #0 {
  %4 = and i8 %0, 127
  %5 = icmp eq i8 %4, 1
  %6 = and i8 %1, 127
  %7 = icmp eq i8 %6, 1
  %8 = or i1 %5, %7
  br i1 %8, label %12, label %9

9:                                                ; preds = %3
  %10 = and i8 %2, 127
  %11 = icmp eq i8 %10, 0
  br i1 %11, label %12, label %13, !prof !5

12:                                               ; preds = %3, %9
  call void (...) @bar() #2
  br label %13

13:                                               ; preds = %12, %9
  ret void
}

!5 = !{!"branch_weights", i32 1, i32 2147483647}

Both entry block and block9 have BranchInst and they both point to block12, so after the performBranchToCommonDestFolding pass, they are merged. And the branch weight is recalculated.

So the final IR is as follows:

define dso_local void @foo(i8 noundef signext %0, i8 noundef signext %1, i8 noundef signext %2) local_unnamed_addr #0 {
  %4 = and i8 %0, 127
  %5 = icmp eq i8 %4, 1
  %6 = and i8 %1, 127
  %7 = icmp eq i8 %6, 1
  %8 = or i1 %5, %7
  %9 = and i8 %2, 127
  %10 = icmp eq i8 %9, 0
  %11 = or i1 %8, %10
  br i1 %11, label %12, label %13, !prof !5

12:                                               ; preds = %3
  call void (...) @bar() #2
  br label %13

13:                                               ; preds = %3, %12
  ret void
}

!5 = !{!"branch_weights", i32 -2147483647, i32 2147483647}

Now the probability of calling function bar() is over 50%, which is very different from what the user defined.

We believe the issue is caused by the branch weight not being updated in the FoldCondBranchOnValueKnownInPredecessor pass. The optimization that this pass does can be simplified to this case:

before:
  PredBB  other2
       \  /
        \/
other1  BB
     \  /\
      \/  \
 RealDest  other3

The block execution frequency of RealDest here can be represented as:

Formula 1:
  BlockFreq(RealDest)
= BlockFreq(BB) * P(BB->RealDest) + other1
= (BlockFreq(PredBB) * P(PredBB->BB) + other2) * P(BB->RealDest) + other1
= BlockFreq(PredBB) * P(PredBB->BB) * P(BB->RealDest) + other2 * P(BB->RealDest) + other1

in which P(A->B) represents the probability from blockA to blockB.

Then after the pass, the graph is updated to:

after:
  PredBB   other2
       |     /
       |    /
other1 |   BB
     \ |   /\
      \|  /  \
   RealDest    other3

PredBB is redirected to RealDest if the BranchInst in BB is always pointing to RealDest when the condition is from PredBB.
The new block execution frequency of RealDest is:

Formula 2:
  BlockFreq(RealDest)
= BlockFreq(PredBB) * newP(PredBB->RealDest) + BlockFreq(BB) * P(BB->RealDest) + other1

in which BlockFreq(BB) is equal to other2 since PredBB no longer points to BB. So it can be represented as:

Formula 3:
  BlockFreq(RealDest)
= BlockFreq(PredBB) * newP(PredBB->RealDest) + other2 * P(BB->RealDest) + other1

Thus, to match Formula 1 and Formula 3, we need to make newP(PredBB->RealDest) in the new graph equal to P(PredBB->BB) * P(BB->RealDest) in the original graph.

One way is to scale all the branch weights from PredBB but not to BB by multiplying a factor x.
Supposing:
(1) A means the sum of all branch weights from PredBB to BB, which is PTITotalTakenWeight in the source code
(2) B means sum of those from PredBB but not to BB, which is PTITotalWeight-PTITotalTakenWeight in the source code
(3) S means the sum of all branch weights from PredBB, which is PTITotalWeight in the source code and equal to A+B
(4) C means the branch weight from BB to RealDest, which is BITakenWeight in the source code
(5) D means the branch weight from BB to the other successor, which is BINotTakenWeight in the source code
So:

   newP(PredBB->RealDest) = P(PredBB->BB) * P(BB->RealDest)
=> A/(A+x*B) = A/(A+B) * C/(C+D)
=> x = 1 + (S/B) * (D/C)

Then for each branch weight from PredBB but not to BB, we do this transformation:

   NewWeight = OldWeight * x
=> NewWeight = OldWeight * (1 + (S/B) * (D/C))
=> NewWeight = OldWeight + (S*OldWeight/B) * (D/C)

which will prevent overflow (see source code for detailed information)

Also, since this change may give branch weights to some BranchInst/SwitchInst that do not have them originally, we skip cases when just giving equal branch weight (default behavior) to them.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

LukeZhuang created this revision.Aug 5 2022, 1:19 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2022, 1:19 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

LukeZhuang requested review of this revision.Aug 5 2022, 1:19 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptAug 5 2022, 1:19 PM

Harbormaster completed remote builds in B179577: Diff 450365.Aug 5 2022, 1:30 PM

lattner resigned from this revision.Aug 5 2022, 3:56 PM

Hello, may I ask do anyone have comments on this change? or any suggestions of fixing this in better ways? Thanks!

This is a jump threading optimization -- is it possible to reuse the branch weight adjustment code from the JumpThreading pass? The updateBlockFreqAndEdgeWeight() method looks relevant.

In D131287#3710144, @nikic wrote:

This is a jump threading optimization -- is it possible to reuse the branch weight adjustment code from the JumpThreading pass? The updateBlockFreqAndEdgeWeight() method looks relevant.

Hi Nikic, thanks for the comments! Actually I have two questions about this:

I just looked at the code, JumpThreading pass seems did the same thing as FoldCondBranchOnValueKnownInPredecessor in SimplifyCFG. May I ask why we need both of them originally? Or is this just my misunderstanding
I previously also thinking of using BlockFrequencyInfo but I found it's not used and maintained in SimplifyCFG pass. Is that means we need to add it?

Hello : )

Hello guys, sorry for bothering again. May I ask could anyone help me look at this change or give suggestions to my previous question?
Because fixing this issue is important to us. Thank you very much!

Gentle ping :) @nikic

RKSimon added inline comments.Aug 19 2022, 6:59 AM

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1099	'Keep halving' suggests there should be a while loop?

LukeZhuang added inline comments.Aug 19 2022, 7:06 AM

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1099	Oh, this function and comment is copied from the FitWeights above. Yeah I agree that it somewhat does not match with comments. I prefer to change comment to like "Fit weight so that ..."

Hi, thanks for working on this.

I'm probably not the most qualified reviewer in this area, but I wanted to supply some feedback. In particular, I've focused on places where I felt like this patch was hard to follow, and have tried to make suggestions that would have aided my understanding.

I also agree w/ @nikic that this looks related to JumpThreadingPass::updateBlockFreqAndEdgeWeight. Do you have thoughts on why that is/isn't relevant?

As a side note, you may get better responses if the # of reviewers is smaller. People tend to assume someone else will review a change if there are lots of people in the list. Moreso if anyone else responds. That was certainly the case w/ me. :)

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
879	I don't think you need this helper function. the `setBranchWeights(SwitchInst,...)` version should be able to handle these cases if you just modify the signature to take `Instruction` instead and add the assert below. The reason being is that `CreateBranchWeights()` only offers a convenient api for branch/select that takes two weights to avoid writing boilerplate at callstites. Internally it just calls the version of `CreateBranchWeights()` that takes an `ArrayRef`. In this case, I think in this case it’s better to follow the same approach and avoid introducing a redundant function implementation.
1100	I’m fairly certain we don’t want to scale the weights down so that the sum of all weights fits in a uint32. For one there are all kinds of things that assume they can be larger. Off the top of my head, Optimization Remarks are often filtered by hotness (i.e. branch weight). But IIRC there are also transforms that only trigger when the weights exceed some threshold. Those may be better served by using a probability, but my recollection is that they do not. Though on this point someone w/ more comprehensive knowledge in this area should probably weigh in. Still, I'm skeptical that we want to do this. I'm a bit surprised that scaling weights down like this does not affect any tests. That may be highlighting some untested paths more than that this change in behavior is doing the right thing.
2998	What is PTI? Is this a common term from somewhere? If it's the predecessor, then `Pred` or `PredInst` seems to communicate more.
3000–3001	Can you rephrase this? I'd like this to be clear to all readers, and it's currently a bit hard to parse.
3003–3005	I'd suggest avoiding variable names from the implementation here where you're documenting your algorithm and approach. Can you also break each variable def/description onto its own line to make reading a bit easier?
3007	This function seems to go to a lot of trouble to avoid overflowing `uint32_t`, which is good, but it seems like things would be significantly simpler if you did the calculations with `uint64_t` and then scaled the weights the same way we do elsewhere. If that is common functionality, then it can be moved to elsewhere, like `IR/ProfDataUtils`. Are there other reasons other reasons, concerns, or context here that I'm missing?
3012	I don't think it's correct to assert this, is it? Weights are allowed to be 32-bits, and summing them could overflow `uint32_t`, and is an expected possibility (see `ProfDataUtils.cpp`). So why is this case different?
3017	Can you rename this `Idx`, or something to more clearly indicate this is an index? It's extremely common in our codebase to use `I` for instructions, and in this case I'm finding myself reading the code incorrectly. This compounded below when you say: // Branch "I" is also one of the PTI not_taken branch and that means
3022	This is significantly easier to follow than the code below. Why not use these names?
3027	If you need an intermediate variable, please call out what it is in your equations. We can all figure it out, but it helps to be either obvious from the naming, or to note in the comment that you're calculating `(totW*Weights[I]/totNTW)`.
3045	OK, so PTI is the Predecessor's Terminator Inst... I still think we need to name some of this differently. PTI is probably fine for reffing the terminator itself when you care about it being a terminator. But in most cases here you're really talking about incoming branches and their weights. PTIWeights doesn't communicate that IMO.
3048–3062	can't you simplify this significantly by treating branch and switch uniformly? We have APIs in ProfDataUtils just for that... You should be able to avoid most of the dyn_casts and the different conditions almost entirely, right? https://github.com/llvm/llvm-project/blob/a7441289e2eb8ccf71c1a60b71d898069e82b22b/llvm/lib/IR/ProfDataUtils.cpp#L111 Assigning default weights can also be made uniform.
3052	We have APIs for checking if an instruction has branch weight metadata. see https://github.com/llvm/llvm-project/blob/a7441289e2eb8ccf71c1a60b71d898069e82b22b/llvm/lib/IR/ProfDataUtils.cpp#L99
3161–3164
3161–3164
3171–3172	Are we inverting the branch condition? why are we swapping? I think I've missed something here...
llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
12–13	I don't know if these comments(here and elsewhere) make sense outside of this patch, since 'Before' and 'After' are only relevant in the context of your change... I think if you want to more thoroughly explain the purpose of the test, you'll need to provide context. A summary of the incorrect behavior and what a regression looks like is probably enough. You can also ref this patch, and anyone coming along later can easily understand why this test exists, and what it's checking. WDYT?

Hi Paul, thank you very much for the suggestions! I'm working on improving the code according to your comments.
For your question about JumpThreading, since I'm not an expert on this, here is my understeading:
(1) I agree that using blockfrequency(like the function that nikic suggested) is the best way to solve this issue. I was trying to use BlockFrequencyInfo at the beginning but I found this information is not maintained and no otherwhere else use this in SimplifyCFG. May I ask what do we usually do for this situation? Is that adding this in SimplifyCFG?
(2) Jumpthreading seems is doing the same thing as FoldCondBranchOnValueKnownInPredecessor in SimplifyCFG, may I ask do you know(or someone else know) why we need both of them? And jumpthreading seems is not in -O1 optimization. I guess if we can use jumpthreading, this issue would not exist at the beginning.
Any way, thank you for the comments!

In D131287#3737801, @LukeZhuang wrote:

(1) I agree that using blockfrequency(like the function that nikic suggested) is the best way to solve this issue. I was trying to use BlockFrequencyInfo at the beginning but I found this information is not maintained and no otherwhere else use this in SimplifyCFG. May I ask what do we usually do for this situation? Is that adding this in SimplifyCFG?

I'm not completely sure I follow you here. Are you asking if it's OK to use BlockFrequencyInfo + BlockProbabilityInfo in SimplifyCFG? I don't see why not if that is the correct thing to do and avoids complexity. Those are analysis passes, so should be safe to take a dependency on. SimplifyCFG is going to invalidate them though, so it's going to introduce overhead. IMO correctness is more important though, so I wouldn't worry about running a (usually quick) set of analysis passes.

The other thing to consider is if JumpThreading should have that functionality extracted to some common place so SimplifyCFG can share the implementation. That would be my preference, but there may be reasons to avoid that.

(2) Jumpthreading seems is doing the same thing as FoldCondBranchOnValueKnownInPredecessor in SimplifyCFG, may I ask do you know(or someone else know) why we need both of them? And jumpthreading seems is not in -O1 optimization. I guess if we can use jumpthreading, this issue would not exist at the beginning.

I'm not an authority on this, so please take my comments here with a grain of salt.

Jump threading is a pretty well known type of optimization. The change in the CFG you're patching up is an example of jump threading: find a block that always jumps to another and bypass the intermediate block. It's simple/safe enough that the maintainers of SimplifyCFG felt that it belonged here. This can be a pretty profitable optimization, so I'm not surprised that's the case.

It's my understanding that the JumpThreadingPass does more sophisticated things than SimplifyCFG, and you may not always want to do that. In some cases, those optimizations can make loops irreducible, so LLVM has historically been conservative w/ it's jump threading implementation. I also believe that's why it only runs a higher optimization levels, but someone with more familiarity with this part of LLVM can probably provide a more complete answer than I can.

Any way, thank you for the comments!

Happy to help, but, again, I'm not sure I'm the most qualified reviewer for this...

BTW, did you look into the rest of SimplifyCFG to see if there are more systemic issues w/ how branch weights are propagated? If there are, a more general solution/refactor may be warranted, and I would advise you to bring it up on discourse to get more interest/feedback.

Hi @paulkirth, sorry for the late reply. If I understand it correctly, BlockFrequencyInfo can only be used with profile information. I have found the original author of updateBlockFreqAndEdgeWeight(https://reviews.llvm.org/D10979 and https://reviews.llvm.org/D15519), that's why the author added if (HasProfileData) and only do the fix when have profile data.
However, this code is 7 years ago and we added __builtin_expect_with_probability in 2020. This probability issue(that example at the beginning of my description) can also happen even without profile data, the user will find the probability of executing bar is not expected. So it’s not sufficient for us to only do this with profile data. This is our concern about using BlockFrequencyInfo.
And again thank you very much for the kind and useful comments! I think all of them are reasonable and is working on it.

LukeZhuang added a reviewer: congh.Aug 23 2022, 5:13 PM

LukeZhuang added inline comments.Aug 24 2022, 8:35 AM

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1100	Hi, I have another question about this, in D10979 which I mentioned previously, the author wrote something similar: `normalizeEdgeWeights` which also scale the sum to be within UINT32_MAX. But later in D15519 he changed all of weight to probability. I guess he had changed to this because of probably the same reason as you mentioned, and with probability we do not need to care about integer overflow. However, currently no other place in SimplifyCFG is doing CFG fix using probability, all of them are integer weight calculation. The most typical one is `performBranchToCommonDestFolding`. So I'm concerned about consistency, may I ask is that okay to do probability calculation here?

In D131287#3744258, @LukeZhuang wrote:

Hi @paulkirth, sorry for the late reply. If I understand it correctly, BlockFrequencyInfo can only be used with profile information. I have found the original author of updateBlockFreqAndEdgeWeight(https://reviews.llvm.org/D10979 and https://reviews.llvm.org/D15519), that's why the author added if (HasProfileData) and only do the fix when have profile data.

I'm not sure that was the rationale looking at the original patch. I assume it was just because branch weights were assumed to not be present if profiling wasn't enabled. The bigger issue is that I think BPI and BFI passes assume loops are simplified, which I think happen in this pass. https://github.com/llvm/llvm-project/blob/f24c299e7d733c0afb9a17655201f37a43d84ce5/llvm/lib/Analysis/BlockFrequencyInfo.cpp#L9

However, this code is 7 years ago and we added __builtin_expect_with_probability in 2020. This probability issue(that example at the beginning of my description) can also happen even without profile data, the user will find the probability of executing bar is not expected. So it’s not sufficient for us to only do this with profile data. This is our concern about using BlockFrequencyInfo.
And again thank you very much for the kind and useful comments! I think all of them are reasonable and is working on it.

branch weights are how we derive probability and frequency information. __builtin_expect_with_probability still adds weights, it just adds them using the defined probability distribution. https://github.com/llvm/llvm-project/blob/f24c299e7d733c0afb9a17655201f37a43d84ce5/llvm/lib/Transforms/Scalar/LowerExpectIntrinsic.cpp#L63

As mentioned above, the bigger issue is that the BPI and BFI passes may be incompatible w/ simplifyCFG. You may not need to run the passes though. I would investigate whether you can construct/instantiate the BranchProbability/BlockFrequency directly given the input blocks and still derive the result you're after. Do note though, I haven't looked at the BPI/BFI passes, their algorithms, or the classes deeply, so this may not help. It's just that for correctly balancing/scaling weights I doubt you would need to run the full passes. That's just a suspicion though, and I may be steering you the wrong way. It's just what I would investigate.

LukeZhuang removed reviewers: lebedev.ri, lattner, danielcdh, manmanren, aeubanks, eraman, atrick, samparker, kazu, dmgreen, congh.Aug 24 2022, 1:07 PM

paulkirth added inline comments.Aug 24 2022, 2:14 PM

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1100	Looks like I may have steered you in the wrong direction. BPI pass is going to scale these anyway. see https://github.com/llvm/llvm-project/blob/27d666b9adc12a13f67ccde96d5b4d181f56ffc5/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L420 We should just do what they do, and since this is now a common functionality, we should extract that logic and make it reusable. `ProfDataUtils` is a good candidate for where that code should live, since this is about branch weights, but if there is somewhere else that makes sense, that is also fine. I'd also suggest making that a separate patch, since its kind of a code clean up that is orthogonal to this change.

Hi @paulkirth, I summarized all the issues about BFI and bring it up in discourse: https://discourse.llvm.org/t/is-that-blockfrequencyinfo-can-only-be-used-with-profile-data/64855. By David's response, I think I can start to try to add BFI in SimplifyCFG to see if it's a feasible way.

David is certainly an expert on this part of the compiler, so I expect you
to be successful using his advice.

I am also not entirely sure you need to go to those lengths though, as I
mentioned in my earlier comment. But if you find a straightforward
solution, I'll be happy to offer any help I can.

I did not look at the patch in details. At a high level, it seems that the pass itself does not use profile data to make decisions, so BPI/BFI is really not needed. Instead, directly updating the profile meta data as transformation happens is reasonable (unless BFI provides updating APIs for cases like this, which is not the case).

Regarding the duplication between JumpThreading and SimplifyCFG, this basically comes down to phase ordering. SimplifyCFG is run many times throughout the pipeline, while passes like JumpThreading run only once. For that reason, SimplifyCFG often implements some basic version of optimizations that have a more powerful implementation in a dedicated pass.

Regarding BFI/BPI, I don't think we want to use those in SimplifyCFG. I don't think we can reasonably preserve BFI though all SimplifyCFG transforms -- and I believe it also depends on DT and LI, which we don't want to have as SimplifyCFG dependencies at this time.

When I mentioned updateBlockFreqAndEdgeWeight() that was more as a pointer to code that should be doing something similar -- if it uses BFI, then we indeed shouldn't reuse it directly (but maybe there is some useful logic that can be shared -- I don't know). It is a bit odd that your code seems to be substantially more complicated than what JumpThreading does.

Anyway, I'm not really familiar with branch weight metadata, so these were really just some drive-by notes, someone who is familiar should review this.

Thank you all for the help!
@paulkirth, your earlier comments really helps and I'm improving the code now.
@davidxl, I also found it's difficult to add BFI in SimplifyCFG and may I ask could you please help me check the high level of this patch and whether this updating method makes sense?
@nikic, anyway thank you for the suggestions. I think it looks harder than JumpThreading because our final goal is to keep block frequency the same after CFG change. Jumpthreading knows BFI and need to update branch(just like already know the result and want to find the reason), but SimplifyCFG cannot use BFI so it's like doing something in a reverse manner. That's just my guess.

LukeZhuang removed a subscriber: nikic.Aug 29 2022, 3:18 PM

@davidxl Gentle ping :)

davidxl added inline comments.Sep 6 2022, 9:05 AM

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
12–13	The comment can be enhanced for reader to follow. Can you add an ascii art of the CFG after simplifyCFG
90	Similarly add a cfg description post transformation.
132	Can you split out the overflow fix in a different patch?
175	Add post simplfycfg CFG

Sorry for the late reply. @paulkirth @davidxl I just updated the patch according to your comments

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
2998	I think I used this name because I found some other functions also uses it. For example, https://github.com/llvm/llvm-project/blob/49e7ef2c09facd722a29a5ad96a7f8f16e362b28/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L1390, as well as #L1096. Yeah I agree that that might be confusing if reading the definition first before its usage. I added description to it. Thanks for pointing out
3007	Currently I already did this in uint64_t. I need to add the overflow handling because in the original formula, there will be a product of three uint32_t(totW * Weights[i] * BNTW) which will overflow even in uint64_t.
3012	As we previously discussed, https://github.com/llvm/llvm-project/blob/49e7ef2c09facd722a29a5ad96a7f8f16e362b28/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L432 also asserts that
3171–3172	Because in the previous line of "BIHasWeights" I assumed that the "true successor" is BB->RealDest and "false successor" is the other one when calling extractBranchWeights. This line is used to swap them if "false successor" is actually BB->RealDest. Sorry for the confusing, I also added comments to avoid misunderstanding

Harbormaster completed remote builds in B187158: Diff 460769.Sep 16 2022, 8:26 AM

paulkirth added inline comments.Sep 22 2022, 10:09 AM

llvm/lib/Transforms/Utils/SimplifyCFG.cpp
1100	IMO this seems to be a common functionality that both BranchProbabilityInfo and SimplyfyCFG need. While IMO, https://github.com/llvm/llvm-project/blob/27d666b9adc12a13f67ccde96d5b4d181f56ffc5/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L420 still seems like the right approach, whichever implementation we go with should be factored out into a common utility. That way all the parts of LLVM that need this functionality can benefit.
2998–3000	Is this an accurate way to rephrase this? It's how I read it, but it seems inconsistent w/ the comments + formula below.
3013	This seems to conflict w/ your description above. "... to be its original value times the probability of..." Implies to me that `X = BTW/TotW`. Either the description above is incomplete, or this isn't the right formula to use. Can we make these consistent?
3030–3032	Do you think this edit would make it easier to follow? Is it missing any important information?
3042	nit: I is now Idx, here and elsewhere in this description
3170–3171	nit: you don't need this variable at all, since you only use it in the conditional below. `if(extractBranchWeights(...))` is probably a bit more idiomatic.

davidxl added inline comments.Oct 5 2022, 9:08 AM

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight-overflow.ll
10 ↗	(On Diff #460769)	In this case, why does the prof meta data for the switch need to change in the first place? Can the original meta data be retained? Regarding overflow, should the branch weights be scaled down first to avoid the problem in the first place?
llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
59	By looking at example only, it looks like the right way is to compute the branch weights for entry block using backward propagation. Say entry->lor.end has weights x, and sentry->lor.rhs has weights y. Assuming entry (and lor.end) has weights 100. The following system equation holds (assumes no path sensitivity, and thus the branch weights after transformation remains the same for lor.rhs), we have x+ y = 100 x + y/10 = 10 And it gives x = 0, y = 100 Please note that, due to the possibility of path sensitive behavior, there is no reason to believe that probability of lor.rhs without this patch is any worse than the one with the patch.

And @paulkirth, also thank you very much for the comments and I'm working on improving it

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight-overflow.ll
10 ↗	(On Diff #460769)	I created this test to not testing the feature itself but to catch a potential risk in my method that may overflow if I didn't handle it appropriately. I already scale down it in the code, but even so, I will still need to calculate the product three uint32_t, and I proved that the product of two of them will not exceed UINT_MAX. So this test is intended for this.
llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
59	Hi David, thanks for the comments! I have a question on it. The backward propagation require us to know the block frequency info(or especially know the whole CFG), may I ask is that correct? Otherwise I think we cannot know lor.end has the same weight as entry, and if.then has 10% weights of entry. So I think that goes back to our previous discussion. The first issue is the transformation in SimplifyCFG does not know either BlockFrequencyInfo nor the whole CFG. The second issue is this branch weight maintenance can also happen with static code prediction itself(using __builtin_expect in my example), in which case we do not have frequency info from dynamic analysis such as PGO. Please correct me if I understand it incorrectly.

davidxl added inline comments.Oct 7 2022, 9:07 AM

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll
59	The branch weights are relative numbers (to compute branch probability), comparison between weights between nodes are meaningless. BFI uses BPI information to compute block frequencies. Block frequencies are relative but at function level -- comparing block frequency across functions are not meaningful. With function entry count (real profile) available, block frequency can be converted to block count which is globally (whole program) valid and comparable. In other words, with a single entry/exit region, you should be able to propagate branch weights (to be more precise, to propagate taken/notaken ratio, or probability, not the raw weight values). You can assume entry node with any frequency for that matter. The key problem with back propagation is that it assumes the probability of lor.lhs does not change after the splitting. This assumption may not hold in reality. It is equally reasonable to assume a probability at entry node and do forward propagation.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Utils/

SimplifyCFG.cpp

131 lines

test/

Transforms/

SimplifyCFG/

fold-cond-bi-fix-weight.ll

236 lines

Diff 450365

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 869 Lines • ▼ Show 20 Lines static void setBranchWeights(Instruction *I, uint32_t TrueWeight,

// nullptr to setMetadata which will erase the existing metadata. // nullptr to setMetadata which will erase the existing metadata.

MDNode *N = nullptr; MDNode *N = nullptr;

if (TrueWeight || FalseWeight) if (TrueWeight || FalseWeight)

N = MDBuilder(I->getParent()->getContext()) N = MDBuilder(I->getParent()->getContext())

.createBranchWeights(TrueWeight, FalseWeight); .createBranchWeights(TrueWeight, FalseWeight);

I->setMetadata(LLVMContext::MD_prof, N); I->setMetadata(LLVMContext::MD_prof, N);

} }

// A general interface to set branch_weights for SwitchInst/BranchInst/SelInst

static void setBranchWeights(Instruction *I, ArrayRef<uint32_t> Weights) {

paulkirthUnsubmitted

Done

I don't think you need this helper function.

the setBranchWeights(SwitchInst*,...) version should be able to handle these cases if you just modify the signature to take Instruction* instead and add the assert below.

The reason being is that CreateBranchWeights() only offers a convenient api for branch/select that takes two weights to avoid writing boilerplate at callstites. Internally it just calls the version of CreateBranchWeights() that takes an ArrayRef.

In this case, I think in this case it’s better to follow the same approach and avoid introducing a redundant function implementation.

paulkirth: I don't think you need this helper function. the `setBranchWeights(SwitchInst*,...)` version…

assert(isa<SwitchInst>(I) || isa<BranchInst>(I) || isa<SelectInst>(I));

if (isa<SwitchInst>(I)) {

SwitchInst *SI = dyn_cast<SwitchInst>(I);

setBranchWeights(SI, Weights);

} else {

assert(Weights.size() == 2);

setBranchWeights(I, Weights[0], Weights[1]);

}

/// If TI is known to be a terminator instruction and its block is known to /// If TI is known to be a terminator instruction and its block is known to

/// only have a single predecessor block, check to see if that predecessor is /// only have a single predecessor block, check to see if that predecessor is

/// also a value comparison with the same value, and if that comparison /// also a value comparison with the same value, and if that comparison

/// determines the outcome of this comparison. If so, simplify TI. This does a /// determines the outcome of this comparison. If so, simplify TI. This does a

/// very limited form of jump threading. /// very limited form of jump threading.

bool SimplifyCFGOpt::SimplifyEqualityComparisonWithOnlyPredecessor( bool SimplifyCFGOpt::SimplifyEqualityComparisonWithOnlyPredecessor(

Instruction *TI, BasicBlock *Pred, IRBuilder<> &Builder) { Instruction *TI, BasicBlock *Pred, IRBuilder<> &Builder) {

Value *PredVal = isValueEqualityComparison(Pred->getTerminator()); Value *PredVal = isValueEqualityComparison(Pred->getTerminator());

▲ Show 20 Lines • Show All 193 Lines • ▼ Show 20 Lines static void FitWeights(MutableArrayRef<uint64_t> Weights) {

uint64_t Max = *std::max_element(Weights.begin(), Weights.end()); uint64_t Max = *std::max_element(Weights.begin(), Weights.end());

if (Max > UINT_MAX) { if (Max > UINT_MAX) {

unsigned Offset = 32 - countLeadingZeros(Max); unsigned Offset = 32 - countLeadingZeros(Max);

for (uint64_t &I : Weights) for (uint64_t &I : Weights)

I >>= Offset; I >>= Offset;

} }

/// Keep halving the weights until the sum of weights can fit in uint32_t.

RKSimonUnsubmitted

Done

'Keep halving' suggests there should be a while loop?

RKSimon: 'Keep halving' suggests there should be a while loop?

LukeZhuangAuthorUnsubmitted

Done

Oh, this function and comment is copied from the FitWeights above. Yeah I agree that it somewhat does not match with comments. I prefer to change comment to like "Fit weight so that ..."

LukeZhuang: Oh, this function and comment is copied from the FitWeights above. Yeah I agree that it…

static void FitTotalWeights(MutableArrayRef<uint64_t> Weights) {

paulkirthUnsubmitted

Done

I’m fairly certain we don’t want to scale the weights down so that the sum of all weights fits in a uint32.

For one there are all kinds of things that assume they can be larger. Off the top of my head, Optimization Remarks are often filtered by hotness (i.e. branch weight). But IIRC there are also transforms that only trigger when the weights exceed some threshold. Those may be better served by using a probability, but my recollection is that they do not. Though on this point someone w/ more comprehensive knowledge in this area should probably weigh in. Still, I'm skeptical that we want to do this.

I'm a bit surprised that scaling weights down like this does not affect any tests. That may be highlighting some untested paths more than that this change in behavior is doing the right thing.

paulkirth: I’m fairly certain we don’t want to scale the weights down so that the sum of all weights fits…

LukeZhuangAuthorUnsubmitted

Done

Hi, I have another question about this, in D10979 which I mentioned previously, the author wrote something similar: normalizeEdgeWeights which also scale the sum to be within UINT32_MAX. But later in D15519 he changed all of weight to probability. I guess he had changed to this because of probably the same reason as you mentioned, and with probability we do not need to care about integer overflow.
However, currently no other place in SimplifyCFG is doing CFG fix using probability, all of them are integer weight calculation. The most typical one is performBranchToCommonDestFolding. So I'm concerned about consistency, may I ask is that okay to do probability calculation here?

LukeZhuang: Hi, I have another question about this, in D10979 which I mentioned previously, the author…

paulkirthUnsubmitted

Done

Looks like I may have steered you in the wrong direction. BPI pass is going to scale these anyway. see https://github.com/llvm/llvm-project/blob/27d666b9adc12a13f67ccde96d5b4d181f56ffc5/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L420

We should just do what they do, and since this is now a common functionality, we should extract that logic and make it reusable. ProfDataUtils is a good candidate for where that code should live, since this is about branch weights, but if there is somewhere else that makes sense, that is also fine. I'd also suggest making that a separate patch, since its kind of a code clean up that is orthogonal to this change.

paulkirth: Looks like I may have steered you in the wrong direction. BPI pass is going to scale these…

paulkirthUnsubmitted

Not Done

IMO this seems to be a common functionality that both BranchProbabilityInfo and SimplyfyCFG need. While IMO, https://github.com/llvm/llvm-project/blob/27d666b9adc12a13f67ccde96d5b4d181f56ffc5/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L420 still seems like the right approach, whichever implementation we go with should be factored out into a common utility. That way all the parts of LLVM that need this functionality can benefit.

paulkirth: IMO this seems to be a common functionality that both BranchProbabilityInfo and SimplyfyCFG…

uint64_t Sum = std::accumulate(Weights.begin(), Weights.end(), (uint64_t)0);

if (Sum > UINT_MAX) {

unsigned Offset = 32 - countLeadingZeros(Sum);

for (uint64_t &I : Weights)

I >>= Offset;

}

static void CloneInstructionsIntoPredecessorBlockAndUpdateSSAUses( static void CloneInstructionsIntoPredecessorBlockAndUpdateSSAUses(

BasicBlock *BB, BasicBlock *PredBlock, ValueToValueMapTy &VMap) { BasicBlock *BB, BasicBlock *PredBlock, ValueToValueMapTy &VMap) {

Instruction *PTI = PredBlock->getTerminator(); Instruction *PTI = PredBlock->getTerminator();

// If we have bonus instructions, clone them into the predecessor block. // If we have bonus instructions, clone them into the predecessor block.

// Note that there may be multiple predecessor blocks, so we cannot move // Note that there may be multiple predecessor blocks, so we cannot move

// bonus instructions to a predecessor block. // bonus instructions to a predecessor block.

for (Instruction &BonusInst : *BB) { for (Instruction &BonusInst : *BB) {

▲ Show 20 Lines • Show All 1,873 Lines • ▼ Show 20 Lines static ConstantInt *getKnownValueOnEdge(Value *V, BasicBlock *From,

if (BI && BI->isConditional() && BI->getCondition() == V && if (BI && BI->isConditional() && BI->getCondition() == V &&

BI->getSuccessor(0) != BI->getSuccessor(1)) BI->getSuccessor(0) != BI->getSuccessor(1))

return BI->getSuccessor(0) == To ? ConstantInt::getTrue(BI->getContext()) return BI->getSuccessor(0) == To ? ConstantInt::getTrue(BI->getContext())

: ConstantInt::getFalse(BI->getContext()); : ConstantInt::getFalse(BI->getContext());

return nullptr; return nullptr;

} }

/// We want to make the new branch probability of PTI taken(from PredBB to BB)

paulkirthUnsubmitted

Done

What is PTI? Is this a common term from somewhere? If it's the predecessor, then Pred or PredInst seems to communicate more.

paulkirth: What is PTI? Is this a common term from somewhere? If it's the predecessor, then `Pred` or…

LukeZhuangAuthorUnsubmitted

Done

I think I used this name because I found some other functions also uses it. For example, https://github.com/llvm/llvm-project/blob/49e7ef2c09facd722a29a5ad96a7f8f16e362b28/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L1390, as well as #L1096. Yeah I agree that that might be confusing if reading the definition first before its usage. I added description to it. Thanks for pointing out

LukeZhuang: I think I used this name because I found some other functions also uses it. For example, https…

/// to be original one times the probability of BI taken(from BB to Dest),

/// one simple way is to scale all the PTI not_taken branch weight by a factor

paulkirthUnsubmitted

Not Done

return nullptr;

}

- /// We want to update the probability of all the branches of "PredBB Taken"

- /// (PredBB->BB) to be its original value times the probability of "BB

- /// Taken" (BB->Dest). The CFG is as follows:

+ /// We want to update the branch weight for branches taken from PredBB->BB

+ /// to match the updated control flow. The new weight should be:

+ ///

+ /// UpdatedWeight = OldWeight * Prob

+ ///

+ /// where:

+ /// OldWeight = Weight from PredBB->BB

+ /// Prob = Probability of taking BB->Dest in the old CFG

+ ///

+ /// The CFG is as follows:

///

/// PredBB

Is this an accurate way to rephrase this? It's how I read it, but it seems inconsistent w/ the comments + formula below.

paulkirth: Is this an accurate way to rephrase this? It's how I read it, but it seems inconsistent w/ the…

/// X. By calculation, X should be the formula below:

paulkirthUnsubmitted

Done

Can you rephrase this? I'd like this to be clear to all readers, and it's currently a bit hard to parse.

paulkirth: Can you rephrase this? I'd like this to be clear to all readers, and it's currently a bit hard…

/// X = 1 + (BNTW/BTW) * (totW/totNTW)

/// in which "BNTW" is BINotTakenWeight and "BTW" is BITakenWeight, "totW" is

/// the sum of branch weights of PTI, "totNTW" is the sum of all PTI not_taken

/// branch weight(which is totW minus sum of PTI taken branches)

paulkirthUnsubmitted

Done

I'd suggest avoiding variable names from the implementation here where you're documenting your algorithm and approach.

Can you also break each variable def/description onto its own line to make reading a bit easier?

paulkirth: I'd suggest avoiding variable names from the implementation here where you're documenting your…

/// The value will not overflow if totW is in uint32_t range(see details below)

static void ScaleWeights(uint64_t BITakenWeight, uint64_t BINotTakenWeight,

paulkirthUnsubmitted

Not Done

This function seems to go to a lot of trouble to avoid overflowing uint32_t, which is good, but it seems like things would be significantly simpler if you did the calculations with uint64_t and then scaled the weights the same way we do elsewhere. If that is common functionality, then it can be moved to elsewhere, like IR/ProfDataUtils.

Are there other reasons other reasons, concerns, or context here that I'm missing?

paulkirth: This function seems to go to a lot of trouble to avoid overflowing `uint32_t`, which is good…

LukeZhuangAuthorUnsubmitted

Done

Currently I already did this in uint64_t. I need to add the overflow handling because in the original formula, there will be a product of three uint32_t(totW * Weights[i] * BNTW) which will overflow even in uint64_t.

LukeZhuang: Currently I already did this in uint64_t. I need to add the overflow handling because in the…

SmallVector<uint64_t, 8> &PTIWeights,

const SmallDenseSet<unsigned, 4> &PTITakenIndexes) {

uint64_t PTITotalWeight =

std::accumulate(PTIWeights.begin(), PTIWeights.end(), (uint64_t)0);

assert(PTITotalWeight <= UINT32_MAX);

paulkirthUnsubmitted

Not Done

I don't think it's correct to assert this, is it? Weights are allowed to be 32-bits, and summing them could overflow uint32_t, and is an expected possibility (see ProfDataUtils.cpp). So why is this case different?

paulkirth: I don't think it's correct to assert this, is it? Weights are allowed to be 32-bits, and…

LukeZhuangAuthorUnsubmitted

Done

As we previously discussed, https://github.com/llvm/llvm-project/blob/49e7ef2c09facd722a29a5ad96a7f8f16e362b28/llvm/lib/Analysis/BranchProbabilityInfo.cpp#L432 also asserts that

LukeZhuang: As we previously discussed, https://github.com/llvm/llvm…

uint64_t PTITotalTakenWeight = 0;

paulkirthUnsubmitted

Not Done

This seems to conflict w/ your description above. "... to be its original value times the probability of..." Implies to me that X = BTW/TotW.

Either the description above is incomplete, or this isn't the right formula to use. Can we make these consistent?

paulkirth: This seems to conflict w/ your description above. "... to be its original value times the…

for (unsigned I : PTITakenIndexes)

PTITotalTakenWeight += PTIWeights[I];

for (unsigned I = 0, E = PTIWeights.size(); I != E; I++) {

paulkirthUnsubmitted

Done

Can you rename this Idx, or something to more clearly indicate this is an index? It's extremely common in our codebase to use I for instructions, and in this case I'm finding myself reading the code incorrectly.

This compounded below when you say:

// Branch "I" is also one of the PTI not_taken branch and that means

paulkirth: Can you rename this `Idx`, or something to more clearly indicate this is an index? It's…

if (PTITakenIndexes.find(I) != PTITakenIndexes.end())

continue;

// Small adjustment by doing add below,

// so newWeights[I] = Weights[I] + (totW*Weights[I]/totNTW) * (BNTW/BTW)

paulkirthUnsubmitted

Done

This is significantly easier to follow than the code below. Why not use these names?

paulkirth: This is significantly easier to follow than the code below. Why not use these names?

// Branch "I" is also one of the PTI not_taken branch and that means

// Weights[I]/totNTW <= 1, thus Inc is at most totW and at most UINT_MAX.

// Add max(1,w) because nothing prevents weight to be 0

uint64_t Inc = PTITotalWeight * PTIWeights[I] /

paulkirthUnsubmitted

Done

If you need an intermediate variable, please call out what it is in your equations.

We can all figure it out, but it helps to be either obvious from the naming, or to note in the comment that you're calculating (totW*Weights[I]/totNTW).

paulkirth: If you need an intermediate variable, please call out what it is in your equations. We can…

std::max(1UL, (PTITotalWeight - PTITotalTakenWeight));

// The Inc result here is at most UINT32_MAX ^ 2

Inc = (Inc * BINotTakenWeight) / std::max(1UL, BITakenWeight);

paulkirthUnsubmitted

Not Done

uint64_t TotTakenWeights = 0;

- // PTITakenIndexes stores the indexes of BB among all the successors of

- // PredBB, which is also the indexes of weights that corresponding to

- // PredBB->BB in the PredBB Terminator Instruction's weight array

+ // Weights: branch weights for PredBB.

+ // TakenIndexes: the set of all indexes in Weights corresponding to a path from PredBB->BB

for (unsigned Idx : PTITakenIndexes)

Do you think this edit would make it easier to follow? Is it missing any important information?

paulkirth: Do you think this edit would make it easier to follow? Is it missing any important information?

// UINT64_MAX = UINT32_MAX ^ 2 + 2 * UINT32_MAX, and here Weights[I] is

// at most UINT32_MAX ^ 2 + UINT32_MAX and thus will not exceed UINT64_MAX

PTIWeights[I] += Inc;

}

static void FixupBranchWeight(BasicBlock *BB,

const ArrayRef<BasicBlock *> &PredBBs,

uint64_t BITakenWeight,

uint64_t BINotTakenWeight) {

paulkirthUnsubmitted

Not Done

continue;

- // Now considering the branch indexed by "I". This branch is from PredBB

+ // Now considering the branch indexed by "Idx". This branch is from PredBB

// but not to BB.

nit: I is now Idx, here and elsewhere in this description

paulkirth: nit: I is now Idx, here and elsewhere in this description

for (unsigned i = 0, e = PredBBs.size(); i != e; ++i) {

BasicBlock *PredBB = PredBBs[i];

Instruction *PTI = PredBB->getTerminator();

paulkirthUnsubmitted

Done

OK, so PTI is the Predecessor's Terminator Inst... I still think we need to name some of this differently. PTI is probably fine for reffing the terminator itself when you care about it being a terminator. But in most cases here you're really talking about incoming branches and their weights. PTIWeights doesn't communicate that IMO.

paulkirth: OK, so PTI is the Predecessor's Terminator Inst... I still think we need to name some of this…

// We only need to fix branch weights for BranchInst/SwitchInst

SmallVector<uint64_t, 8> PTIWeights;

if (BranchInst *PBI = dyn_cast<BranchInst>(PTI)) {

if (PBI->isUnconditional())

continue;

uint64_t PBITrue, PBIFalse;

bool PBIHasWeights = PBI->extractProfMetadata(PBITrue, PBIFalse);

paulkirthUnsubmitted

Done

We have APIs for checking if an instruction has branch weight metadata.

see https://github.com/llvm/llvm-project/blob/a7441289e2eb8ccf71c1a60b71d898069e82b22b/llvm/lib/IR/ProfDataUtils.cpp#L99

paulkirth: We have APIs for checking if an instruction has branch weight metadata. see https://github.

if (!PBIHasWeights)

PBITrue = PBIFalse = 1;

PTIWeights.push_back(PBITrue);

PTIWeights.push_back(PBIFalse);

} else if (SwitchInst *PSI = dyn_cast<SwitchInst>(PTI)) {

if (HasBranchWeights(PSI))

GetBranchWeights(PSI, PTIWeights);

else

PTIWeights.assign(1 + PSI->getNumCases(), 1);

} else

paulkirthUnsubmitted

Done

can't you simplify this significantly by treating branch and switch uniformly? We have APIs in ProfDataUtils just for that...

You should be able to avoid most of the dyn_casts and the different conditions almost entirely, right?

https://github.com/llvm/llvm-project/blob/a7441289e2eb8ccf71c1a60b71d898069e82b22b/llvm/lib/IR/ProfDataUtils.cpp#L111

Assigning default weights can also be made uniform.

paulkirth: can't you simplify this significantly by treating branch and switch uniformly? We have APIs in…

continue;

assert(PTIWeights.size() == PTI->getNumSuccessors());

SmallVector<uint64_t, 8> PTIOriginalWeights = PTIWeights;

FitTotalWeights(PTIWeights); // required by ScaleWeights above

SmallDenseSet<unsigned, 4> PTITakenIndexes;

for (unsigned I = 0, E = PTIWeights.size(); I != E; I++)

if (PTI->getSuccessor(I) == BB)

PTITakenIndexes.insert(I);

assert(!PTITakenIndexes.empty());

ScaleWeights(BITakenWeight, BINotTakenWeight, PTIWeights, PTITakenIndexes);

FitWeights(PTIWeights);

if (PTIWeights != PTIOriginalWeights) {

// only set when the new weights is different from original one

SmallVector<uint32_t, 8> PTIMDWeights(PTIWeights.begin(),

PTIWeights.end());

setBranchWeights(PTI, PTIMDWeights);

}

/// If we have a conditional branch on something for which we know the constant /// If we have a conditional branch on something for which we know the constant

/// value in predecessors (e.g. a phi node in the current block), thread edges /// value in predecessors (e.g. a phi node in the current block), thread edges

/// from the predecessor to their ultimate destination. /// from the predecessor to their ultimate destination.

static Optional<bool> static Optional<bool>

FoldCondBranchOnValueKnownInPredecessorImpl(BranchInst *BI, DomTreeUpdater *DTU, FoldCondBranchOnValueKnownInPredecessorImpl(BranchInst *BI, DomTreeUpdater *DTU,

const DataLayout &DL, const DataLayout &DL,

AssumptionCache *AC) { AssumptionCache *AC) {

SmallMapVector<ConstantInt *, SmallSetVector<BasicBlock *, 2>, 2> KnownValues; SmallMapVector<ConstantInt *, SmallSetVector<BasicBlock *, 2>, 2> KnownValues;

▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines for (const auto &Pair : KnownValues) {

// Split the predecessors we are threading into a new edge block. We'll // Split the predecessors we are threading into a new edge block. We'll

// clone the instructions into this block, and then redirect it to RealDest. // clone the instructions into this block, and then redirect it to RealDest.

BasicBlock *EdgeBB = SplitBlockPredecessors(BB, PredBBs, ".critedge", DTU); BasicBlock *EdgeBB = SplitBlockPredecessors(BB, PredBBs, ".critedge", DTU);

// TODO: These just exist to reduce test diff, we can drop them if we like. // TODO: These just exist to reduce test diff, we can drop them if we like.

EdgeBB->setName(RealDest->getName() + ".critedge"); EdgeBB->setName(RealDest->getName() + ".critedge");

EdgeBB->moveBefore(RealDest); EdgeBB->moveBefore(RealDest);

// After redirecting PredBB to RealDest, we also need to update the branch

// weight of PTI(the terminator of PredBB).

// Let's use P(A->B) to represent the branch probability from block A to

// block B, and BlockFreq(A) to represent the block frequency of block A.

// So that the new P(PredBB->RealDest) = P(PredBB->BB) * P(BB->RealDest)

// Because original block frequency of RealDest is

// BlockFreq(RealDest)

// = BlockFreq(BB) * P(BB->RealDest) + eps1

// = (BlockFreq(PredBB) * P(PredBB->BB) + eps2) * P(BB->RealDest) + eps1

// = BlockFreq(PredBB) * P(PredBB->BB) * P(BB->RealDest) + eps

paulkirthUnsubmitted

Done

// Because original block frequency of RealDest is

- // BlockFreq(RealDest)

- // = BlockFreq(BB) * P(BB->RealDest) + eps1

- // = (BlockFreq(PredBB) * P(PredBB->BB) + eps2) * P(BB->RealDest) + eps1

- // = BlockFreq(PredBB) * P(PredBB->BB) * P(BB->RealDest) + eps

+ // BlockFreq(RealDest) = BlockFreq(BB) * P(BB->RealDest) + eps1

+ // = (BlockFreq(PredBB) * P(PredBB->BB) + eps2) * P(BB->RealDest) + eps1

+ // = BlockFreq(PredBB) * P(PredBB->BB) * P(BB->RealDest) + eps

// Assuming that there are no other paths from PredBB to RealDest

paulkirth:

paulkirthUnsubmitted

Done

// Because original block frequency of RealDest is

// BlockFreq(RealDest)

- // = BlockFreq(BB) * P(BB->RealDest) + eps1

- // = (BlockFreq(PredBB) * P(PredBB->BB) + eps2) * P(BB->RealDest) + eps1

- // = BlockFreq(PredBB) * P(PredBB->BB) * P(BB->RealDest) + eps

+ // = BlockFreq(BB) * P(BB->RealDest) + eps1

+ // = (BlockFreq(PredBB) * P(PredBB->BB) + eps2) * P(BB->RealDest) + eps1

+ // = BlockFreq(PredBB) * P(PredBB->BB) * P(BB->RealDest) + eps

// Assuming that there are no other paths from PredBB to RealDest

paulkirth:

// Assuming that there are no other paths from PredBB to RealDest

uint64_t BITakenWeight, BINotTakenWeight;

bool BIHasWeights =

BI->extractProfMetadata(BITakenWeight, BINotTakenWeight);

// Only update when BI has branch weights

if (BIHasWeights) {

if (!CB->getZExtValue())

paulkirthUnsubmitted

Not Done

nit: you don't need this variable at all, since you only use it in the conditional below.

if(extractBranchWeights(...)) is probably a bit more idiomatic.

paulkirth: nit: you don't need this variable at all, since you only use it in the conditional below. `if…

std::swap(BITakenWeight, BINotTakenWeight);

paulkirthUnsubmitted

Done

Are we inverting the branch condition? why are we swapping? I think I've missed something here...

paulkirth: Are we inverting the branch condition? why are we swapping? I think I've missed something here..

LukeZhuangAuthorUnsubmitted

Done

Because in the previous line of "BIHasWeights" I assumed that the "true successor" is BB->RealDest and "false successor" is the other one when calling extractBranchWeights. This line is used to swap them if "false successor" is actually BB->RealDest.
Sorry for the confusing, I also added comments to avoid misunderstanding

LukeZhuang: Because in the previous line of "BIHasWeights" I assumed that the "true successor" is BB…

FixupBranchWeight(EdgeBB, PredBBs, BITakenWeight, BINotTakenWeight);

}

// Update PHI nodes. // Update PHI nodes.

AddPredecessorToBlock(RealDest, EdgeBB, BB); AddPredecessorToBlock(RealDest, EdgeBB, BB);

// BB may have instructions that are being threaded over. Clone these // BB may have instructions that are being threaded over. Clone these

// instructions into EdgeBB. We know that there will be no uses of the // instructions into EdgeBB. We know that there will be no uses of the

// cloned instructions outside of EdgeBB. // cloned instructions outside of EdgeBB.

BasicBlock::iterator InsertPt = EdgeBB->getFirstInsertionPt(); BasicBlock::iterator InsertPt = EdgeBB->getFirstInsertionPt();

DenseMap<Value *, Value *> TranslateMap; // Track translated values. DenseMap<Value *, Value *> TranslateMap; // Track translated values.

▲ Show 20 Lines • Show All 4,117 Lines • Show Last 20 Lines

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll

This file was added.

				; RUN: opt -S -simplifycfg < %s \| FileCheck %s

				declare void @foo()
				declare void @foo1()
				declare void @foo2()
				declare void @foo3()

				;; This test is for fixing branch_weights in FoldCondBranchOnValueKnownInPredecessor

				;; Testing cases when PredBB has no branchweight(all are initialised as 1)
				;; Actual probability of hitting foo is 1/10(10%)
				;; Before fix, the probability of hitting foo is 11/20(55%)
				;; After fix, the probability of hitting foo is 29/200(14.5%)
				paulkirthUnsubmitted Done Reply Inline Actions I don't know if these comments(here and elsewhere) make sense outside of this patch, since 'Before' and 'After' are only relevant in the context of your change... I think if you want to more thoroughly explain the purpose of the test, you'll need to provide context. A summary of the incorrect behavior and what a regression looks like is probably enough. You can also ref this patch, and anyone coming along later can easily understand why this test exists, and what it's checking. WDYT? paulkirth: I don't know if these comments(here and elsewhere) make sense outside of this patch, since…
				davidxlUnsubmitted Done Reply Inline Actions The comment can be enhanced for reader to follow. Can you add an ascii art of the CFG after simplifyCFG davidxl: The comment can be enhanced for reader to follow. Can you add an ascii art of the CFG after…
				define void @branchinst_noweight(i8 %a0, i8 %a1, i8 %a2) {
				; CHECK-LABEL: @branchinst_noweight(
				; CHECK-NEXT: entry:
				; CHECK: br i1 [[COND:%.*]], label %if.then, label %if.end, !prof [[PROF0:![0-9]+]]
				; CHECK: if.then:
				; CHECK-NEXT: call void @foo()
				; CHECK-NEXT: br label %if.end
				; CHECK: if.end:
				; CHECK-NEXT: ret void
				entry:
				%and0 = and i8 %a0, 127
				%c0 = icmp eq i8 %and0, 1
				%and1 = and i8 %a1, 127
				%c1 = icmp eq i8 %and1, 1
				%or = or i1 %c0, %c1
				br i1 %or, label %lor.end, label %lor.rhs

				lor.rhs:
				%and2 = and i8 %a2, 127
				%c2 = icmp eq i8 %and2, 0
				br label %lor.end

				lor.end:
				%c3 = phi i1 [ true, %entry ], [ %c2, %lor.rhs ]
				br i1 %c3, label %if.then, label %if.end, !prof !0

				if.then:
				call void @foo()
				br label %if.end

				if.end:
				ret void
				}

				;; Testing cases when PredBB already has branch weight
				;; Actual probability of hitting foo is 1/10(10%)
				;; Before fix, the probability of hitting foo is 73/100(73%)
				;; After fix, the probability of hitting foo is 163/1000(16.3%)
				define void @branchinst_withweight(i8 %a0, i8 %a1, i8 %a2) {
				; CHECK-LABEL: @branchinst_withweight(
				; CHECK-NEXT: entry:
				; CHECK: br i1 [[COND:%.*]], label %if.then, label %if.end, !prof [[PROF1:![0-9]+]]
				; CHECK: if.then:
				; CHECK-NEXT: call void @foo()
				; CHECK-NEXT: br label %if.end
				; CHECK: if.end:
				davidxlUnsubmitted Not Done Reply Inline Actions By looking at example only, it looks like the right way is to compute the branch weights for entry block using backward propagation. Say entry->lor.end has weights x, and sentry->lor.rhs has weights y. Assuming entry (and lor.end) has weights 100. The following system equation holds (assumes no path sensitivity, and thus the branch weights after transformation remains the same for lor.rhs), we have x+ y = 100 x + y/10 = 10 And it gives x = 0, y = 100 Please note that, due to the possibility of path sensitive behavior, there is no reason to believe that probability of lor.rhs without this patch is any worse than the one with the patch. davidxl: By looking at example only, it looks like the right way is to compute the branch weights for…
				LukeZhuangAuthorUnsubmitted Not Done Reply Inline Actions Hi David, thanks for the comments! I have a question on it. The backward propagation require us to know the block frequency info(or especially know the whole CFG), may I ask is that correct? Otherwise I think we cannot know lor.end has the same weight as entry, and if.then has 10% weights of entry. So I think that goes back to our previous discussion. The first issue is the transformation in SimplifyCFG does not know either BlockFrequencyInfo nor the whole CFG. The second issue is this branch weight maintenance can also happen with static code prediction itself(using __builtin_expect in my example), in which case we do not have frequency info from dynamic analysis such as PGO. Please correct me if I understand it incorrectly. LukeZhuang: Hi David, thanks for the comments! I have a question on it. The backward propagation require us…
				davidxlUnsubmitted Not Done Reply Inline Actions The branch weights are relative numbers (to compute branch probability), comparison between weights between nodes are meaningless. BFI uses BPI information to compute block frequencies. Block frequencies are relative but at function level -- comparing block frequency across functions are not meaningful. With function entry count (real profile) available, block frequency can be converted to block count which is globally (whole program) valid and comparable. In other words, with a single entry/exit region, you should be able to propagate branch weights (to be more precise, to propagate taken/notaken ratio, or probability, not the raw weight values). You can assume entry node with any frequency for that matter. The key problem with back propagation is that it assumes the probability of lor.lhs does not change after the splitting. This assumption may not hold in reality. It is equally reasonable to assume a probability at entry node and do forward propagation. davidxl: The branch weights are relative numbers (to compute branch probability), comparison between…
				; CHECK-NEXT: ret void
				entry:
				%and0 = and i8 %a0, 127
				%c0 = icmp eq i8 %and0, 1
				%and1 = and i8 %a1, 127
				%c1 = icmp eq i8 %and1, 1
				%or = or i1 %c0, %c1
				br i1 %or, label %lor.end, label %lor.rhs, !prof !1

				lor.rhs:
				%and2 = and i8 %a2, 127
				%c2 = icmp eq i8 %and2, 0
				br label %lor.end

				lor.end:
				%c3 = phi i1 [ true, %entry ], [ %c2, %lor.rhs ]
				br i1 %c3, label %if.then, label %if.end, !prof !0

				if.then:
				call void @foo()
				br label %if.end

				if.end:
				ret void
				}


				;; Testing cases when the terminator of PredBB is SwitchInst
				;; Actual probability of hitting foo is 9/10(90%)
				;; Before fix, the probability of hitting foo is 4/100(4%)
				;; After fix, the probability of hitting foo is 904/1000(90.4%)
				davidxlUnsubmitted Done Reply Inline Actions Similarly add a cfg description post transformation. davidxl: Similarly add a cfg description post transformation.
				define void @switch_test(i32 %a) {
				; CHECK-LABEL: @switch_test(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: switch i32 [[SA1:%.*]], label %end [
				; CHECK-NEXT: i32 2, label %mid
				; CHECK-NEXT: i32 4, label %mid
				; CHECK-NEXT: i32 64, label %mid
				; CHECK-NEXT: i32 128, label %mid
				; CHECK-NEXT: ], !prof [[PROF2:![0-9]+]]
				; CHECK: mid:
				; CHECK-NEXT: call void @foo()
				; CHECK-NEXT: br label %end
				; CHECK: end:
				; CHECK-NEXT: ret void
				entry:
				switch i32 %a, label %key [
				i32 2, label %mid
				i32 4, label %mid
				i32 64, label %mid
				i32 128, label %mid
				], !prof !2

				mid:
				br label %key

				key:
				%c = phi i1 [ true, %entry ], [ false, %mid ]
				br i1 %c, label %end, label %func, !prof !0

				func:
				call void @foo()
				br label %end

				end:
				ret void
				}


				;; Create a test case that will definitely overflow when doing naive add/mul,
				;; by making branch_weights which used in the calculation to be UINT32_MAX.
				;; After calculation, the probability of hitting foo is roughly the same.
				define void @overflow_test(i32 %a) {
				davidxlUnsubmitted Done Reply Inline Actions Can you split out the overflow fix in a different patch? davidxl: Can you split out the overflow fix in a different patch?
				; CHECK-LABEL: @overflow_test(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: switch i32 [[SA2:%.*]], label %end [
				; CHECK-NEXT: i32 2, label %mid
				; CHECK-NEXT: i32 4, label %mid
				; CHECK-NEXT: i32 64, label %mid
				; CHECK-NEXT: i32 128, label %mid
				; CHECK-NEXT: ], !prof [[PROF3:![0-9]+]]
				; CHECK: mid:
				; CHECK-NEXT: call void @foo()
				; CHECK-NEXT: br label %end
				; CHECK: end:
				; CHECK-NEXT: ret void
				entry:
				switch i32 %a, label %key [
				i32 2, label %mid
				i32 4, label %mid
				i32 64, label %mid
				i32 128, label %mid
				], !prof !3

				mid:
				br label %key

				key:
				%c = phi i1 [ true, %entry ], [ false, %mid ]
				br i1 %c, label %end, label %func, !prof !4

				func:
				call void @foo()
				br label %end

				end:
				ret void
				}


				;; Create a little bit mixed test case
				;; original probability is:
				;; func1: 100%, func2: 4%, func3: 8.96%
				;; before fix, they are: func1: 100%, func2: 30%, func3: 50%
				;; after fix, they are: func1: 100%, func2: 1.4%, func3: 1.5%
				define void @more_complex_test(i32 %a) {
				davidxlUnsubmitted Done Reply Inline Actions Add post simplfycfg CFG davidxl: Add post simplfycfg CFG
				; CHECK-LABEL: @more_complex_test(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: switch i32 [[SA3:%.*]], label %end [
				; CHECK-NEXT: i32 2, label %[[FUNC2:.*]]
				; CHECK-NEXT: i32 4, label %[[FUNC3:.*]]
				; CHECK-NEXT: ], !prof [[PROF4:![0-9]+]]
				; CHECK: [[FUNC2]]:
				; CHECK-NEXT: call void @foo2()
				; CHECK-NEXT: br label %end
				; CHECK: [[FUNC3]]:
				; CHECK-NEXT: call void @foo3()
				; CHECK-NEXT: br label %end
				; CHECK: end:
				; CHECK-NEXT: call void @foo1()
				; CHECK-NEXT: ret void
				entry:
				switch i32 %a, label %end [
				i32 2, label %mid1
				i32 4, label %mid2
				i32 8, label %mid3
				], !prof !5

				mid3:
				br label %mid1

				mid1:
				%c0 = phi i1 [ true, %entry ], [ false, %mid3 ]
				br i1 %c0, label %func2, label %mid2, !prof !6

				func2:
				call void @foo2()
				br label %end

				mid2:
				%c1 = phi i1 [ true, %entry ], [ false, %mid1 ]
				br i1 %c1, label %func3, label %end, !prof !6

				func3:
				call void @foo3()
				br label %end

				end:
				call void @foo1()
				ret void
				}


				!0 = !{!"branch_weights", i32 1, i32 9}
				!1 = !{!"branch_weights", i32 7, i32 3}
				!2 = !{!"branch_weights", i32 96, i32 1, i32 1, i32 1, i32 1}
				!3 = !{!"branch_weights", i32 1, i32 -1, i32 -1, i32 -1, i32 -1}
				!4 = !{!"branch_weights", i32 1, i32 -1}
				!5 = !{!"branch_weights", i32 1, i32 3, i32 5, i32 1}
				!6 = !{!"branch_weights", i32 1, i32 99}


				; CHECK: [[PROF0]] = !{!"branch_weights", i32 29, i32 171}
				; CHECK: [[PROF1]] = !{!"branch_weights", i32 163, i32 837}
				; CHECK: [[PROF2]] = !{!"branch_weights", i32 96, i32 226, i32 226, i32 226, i32 226}
				; CHECK: [[PROF3]] = !{!"branch_weights", i32 0, i32 -4, i32 -4, i32 -4, i32 -4}
				; CHECK: [[PROF4]] = !{!"branch_weights", i32 87914, i32 1290, i32 698}

This is an archive of the discontinued LLVM Phabricator instance.

Fix branch weight in FoldCondBranchOnValueKnownInPredecessor pass in SimplifyCFGNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 450365

llvm/lib/Transforms/Utils/SimplifyCFG.cpp

llvm/test/Transforms/SimplifyCFG/fold-cond-bi-fix-weight.ll

Fix branch weight in FoldCondBranchOnValueKnownInPredecessor pass in SimplifyCFG
Needs ReviewPublic