When choosing the best successor for a block, ordinarily we would have preferred
a block that preserves the CFG unless there is a strong probability the other
direction. For small blocks that can be duplicated we now skip that requirement
as well, subject to some simple frequency calculations.
Details
Diff Detail
Event Timeline
Realized that one of the calculations I did was only valid for D28522. Re-worked the calculation for now, and will rebase and update the calculation there.
I like the direction (with more precise cost analysis) this is going. Will review the code soon.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
676 | In this case, what is needed to to invoke 'hasBetterLayoutPredecessor' on PDom block. Dependinng the result, we will know that without tailDup, the layout order is Succ-> PDom or Succ->D->PDom. This will make the cost computation more precise. |
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
431 | Suggest new name : isProfitableToTailDup | |
634 | Dom -> PDom | |
641 | Why not just check if there exists a SuccSucc that post dominates Succ directly? | |
660 | PU + PV == P Also Assuming U > V, the layout order (with tail dup) on should be BB --> Succ --> D so the overall cost is: Q + P V + Q ( which is smaller than Q + QV + PU + PV) | |
661 | We can not assume BB and Succ are in a triangular shape subcfg here -- given where this method is called. Besides, if Succ is not tail-duped, the layout decision may even reject Succ as the layout successor, so the cost is no longer P + V, but 2*Q + V instead (with U > V). In other words, isProfitable check can not be done inside 'hasBetterLayoutPredecessor', but hoisted to the caller of it when 'hasBetterLayoutPrecessor' returns, at which point we will know the layout decision if taildup does not kick in. |
Updated the cost calculation to not rely on the lattice layout.
This resulted in fewer duplications in tests, so those tests changes have been rolled into D28522
I made the calculations in terms of frequency instead of probability.
I adjusted the cost calculation when there is a post dominator based on whether it will be laid out after Succ or not.
Let me know if there are any cost calculations that you think are wrong.
Actually upload the diff with what I said was in the last one:
Use frequency instead of probability
Use slight lookahead for more precise probability calculations.
Let me know what you think. There is a small cleanup that could go in as a separate patch: I switched to a SmallDenseSet because we don't need the orderedness of the SmallVectorSet.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
660 | I thought that too. But without the lattice patch, after duplication, we won't put D after Succ because it now has an unplaced predecessor. The lattice patch fixes the behavior and the calculation. | |
661 | I think I have the calculation right for when Succ would not be the layout successor, but you're right to point out that we should also do the calculation even when Succ is the chosen successor. | |
661 | We now only call this function to check if we should use Succ despite it having been rejected. So we know that Succ is not the layout successor. |
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
280 | There is a reason SmallSetVector is used here -- to make sure the iteration order is deterministic. | |
653 | I assume this is loop back edge source block. You need a test case to cover it. | |
664 | Why break here? | |
669 | nit: -->SuccBestPred | |
677 | Computing BestSuccPred here is unnecessary. See below for more comments. | |
686 | Qin is not necessarily BestSuccPred. Profitability check is called only after hasBetterLayoutPredecessor is returned and it returns true. There are two scenarios it returns true
Either way, the baseline layout to compare (with taildup) is that BB->Succ is the branch taken edge, and BB->C is the fall through edge. Qin should just be Prob(BB->C) | |
702 | PDom is always a successor of Succ according to the way it is computed. | |
708 | The base cost is as wrote, the DupCost however depends on whether P > Q or not. If P > Q, the fallthrough path is BB->Succ->D 2*Q+ P*V If P < Q, the fall through path is BB->C'->D 2*P + Q*V | |
875 | Add more description about what blocks to ignore. |
I'll be glad to add some more comments to explain, but I think the calculations are correct. I've commented individually.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
280 | BlockFilterSet is never iterated. I checked. | |
661 | I think I have the calculation right for when Succ would not be the layout successor, but you're right to point out that we should also do the calculation even when Succ is the chosen successor. | |
664 | Because if PDom is not null, that's all that we look at for the probability calculation. | |
686 | When we place Succ, we remove 2 fallthrough edges BB->C and C'->Succ. Freq(C'->Succ) may be larger than Freq(BB->C). Qout is Freq(BB->C). I don't think Qin should be as well. | |
702 | Thanks. | |
708 | This function is called in a loop looking for the highest probability successor. If Q > P, this function will be ignored and we will lay out Q anyway, so we can ignore the second case. As to the first case: Instead you will get BB->Succ ; C'->D | |
875 | Well, that's really up to the caller. Do you want me to list why you might want to ignore a block? |
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
280 | See for (MachineBasicBlock *LoopBB : LoopBlockSet) fillWorkLists(LoopBB, UpdatedPreds, &LoopBlockSet); | |
686 | differentiate Qin and Qout is fine, but in the code Qin = BestSuccPred which could be Freq(BB->Succ). What I meant is you should directly compute Qin as its definition Freq(C'->Succ) | |
708 | You are right about Q > P case that that scenario will be dropped. It is very subtle, so please add some comment to clarify. Ok -- for the first case, also add a comment | |
875 | something like : e.g, when called under xxx, we want to ignore yyy. See caller zzz for details. However, see my comment in the function, this parameter seems unnecessary. | |
1007 | I think it is equivalent to check Pred == BB. In normal calling context, this is covered by BlockToChain[Pred] == &Chain, but for lookahead case, it is needed to filter BB which is not laid out yet. |
Please mark addressed comments as done. Also let me know if it is ready for another round of review (I saw some issues not addressed such as the deterministic iteration of block filter set).
Marked.
I think it's ready, and I put back the deterministic set.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
686 | Did you still want me to fix something here? |
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
653 | test case for this? | |
683 | Add a short cut here with comments: // If P is not larger, the best successor selection loop will eventually select C, not Succ (as it is not profitable to do so). return false; | |
686 | just add a comment above Qin decl stating that Qin is the largest frequency of Succ's incoming edges which have not been placed. | |
1008 | --> ... for lookhead by isProfitableToTailDup when BB has not yet been placed. |
This version looks almost fine except for one remaining unaddressed comment.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
683 | How about this comment? Early return can 1) speed up the computation and 2) make the following code easier to understand. |
Sorry. I'd replied to the comment, but Phabricator didn't submit it along with my diff update for some reason.
Save the blocks with CFG violations that are duplication candidates. Review them in descending order of probability, so we call isProfitableToTailDup the minimum number of times.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
1091 | Remove the first 'SuccProb > BestProb' check -- it provides only very tiny compile time win depending on the iteration order, but adds more confusion. | |
1108 | no need to set ShouldTailDup in the loop -- it is already initalized outside. | |
1116 | Why not just stable sort it? The vector should be of size 1 for most of the cases. Also why do you need position ? | |
1126 | Should it break instead? | |
1128 | isProfitableToTailDup assumes the baseline layout does not pick Succ. The assumption may not be true here as there are other two possibilities:
In such two cases, isProfitable check should probably be skipped (as it is benefitial) |
Changes from comments:
Just sort the vector instead of make_heap.
If there is a tail duplication opportunity and no other successor, take it.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
1116 | Will just sort the vector. Position is because we rely on the successor order being stable and the first successor being a subtle hint. Without the position, we lose track of whether the block in the vector came before or after the block we picked without tail duplication. | |
1128 | Succ won't equal BestSucc.BB because of the continue. These blocks were not chosen by the first loop by construction. Good catch. I'll add that. |
Per offline discussion, I removed the ordering constraint for blocks that are profitable to tail-duplicate.
This resulted in a lot of test churn, but the source change is relatively small.
This looks very clean now.
However the amount of churns remind me of one thing. Since the profit computation is based on static branch prediction (without PGO), it is the right thing to do to be a little more conservative in taildup. In other words, instead of making 'isProtifiable' return true when the taildup cost is smaller than baseline cost, add a predefined margin (controlled by a parameter):
if (baseline_cost - taildup_cost > threshold)
return true;
return false;
The threshold also roughly models the side effect of taildup -- increased icache footprint etc due to code size increase.
Compare frequencies with a small bias against the tail-duplication side to account for increased icache pressure.
Includes a TODO to handle edge frequencies better in general.
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
151 | perhaps simplify it to tail-dup-penalty ? | |
620 | This basically treats the penality percent parameter as the threshold of normalized improvement: (A-B)/B if ((A-B)/B > PenaltyPercent/100) return true; The problem with this formula is that if B is very hot, it makes (A-B)/B become small, even though the (A-B) is still large. So I think it is better to compute the normalized improvement as (A-B)/Entry_Freq basically the improvement relative to the entry frequency. This will help prevent tail dup from happening in very cold paths. The implementation can makes use of BranchProbablity as well. Suppose we want to implement condition: if ( (A-B)/Entry_Freq > P/100) return true; do this 3 lines: BlockFrequency Profit = A - B; BlockFrequency Threshold = Entry_Freq * BranchProbability(P, 100); return Profit > Threshold; |
lib/CodeGen/MachineBlockPlacement.cpp | ||
---|---|---|
156 | Is this default value too low? Increase it 5 or 10 perhaps? | |
625 | I suppose this logic here is for rounding errors or overflow? Can you explain why the simple scaling with branch prob (in BranchProbablity.cpp) does not work? return Gain > EntryFreq*ThresholdProb; |
lgtm
(I only sampled some test case changes which look reasonable)
test/CodeGen/X86/bt.ll | ||
---|---|---|
27–32 | This test has not changed in behavior. Better to revert the change. |
test/CodeGen/X86/bt.ll | ||
---|---|---|
27–32 | I'll do a complete check for any tests that fall into this category and revert them. |
perhaps simplify it to tail-dup-penalty ?