This is an archive of the discontinued LLVM Phabricator instance.

Enhance loop rotation with existence of profile data in MachineBlockPlacement pass.
ClosedPublic

Authored by congh on Jun 24 2015, 4:18 PM.

Download Raw Diff

Details

Reviewers

chandlerc
djasper

Commits

rG7745dbc5c419: Enhance loop rotation with existence of profile data in MachineBlockPlacement…
rL250754: Enhance loop rotation with existence of profile data in MachineBlockPlacement…

Summary

Currently, in MachineBlockPlacement pass the loop is rotated to let the best exit to be the last BB in the loop chain, to maximize the fall-through from the loop to outside. With profile data, we can determine the cost in terms of missed fall through opportunities when rotating a loop chain and select the best rotation. Basically, there are three kinds of cost to consider for each rotation:

The possibly missed fall through edge (if it exists) from BB out of the loop to the loop header.
The possibly missed fall through edges (if they exist) from the loop exits to BB out of the loop.
The missed fall through edge (if it exists) from the last BB to the first BB in the loop chain.

Therefore, the cost for a given rotation is the sum of costs listed above. We select the best rotation with the smallest cost. This is only for PGO mode when we have more precise edge frequencies.

Diff Detail

Repository: rL LLVM

Event Timeline

congh updated this revision to Diff 28417.Jun 24 2015, 4:18 PM

congh retitled this revision from to Enhance loop rotation with existence of profile data in MachineBlockPlacement pass..

congh updated this object.

congh edited the test plan for this revision. (Show Details)

congh added a reviewer: chandlerc.

congh added a subscriber: Unknown Object (MLST).

Will try to get to this patch soon, but if you haven't yet, you should
start working on some reduced test cases similar to the existing ones to
use in conjunction with this mode.

I have added a test case for this patch, which checks whether the loop will be rotated if there is profile data for the function so that the hot back edge will become fall-through.

Bug fix for checking if a function has profile data (from using hasMetadata() to getEntryCount()).

Update the patch to consider the cost of unconditional jump from outside of the loop to the loop header.

davidxl added a subscriber: davidxl.Jun 26 2015, 2:18 PM

davidxl added inline comments.

lib/CodeGen/MachineBlockPlacement.cpp
250 ↗	(On Diff #28490)	Why changing this interface at all? There does not seem to be necessary.
761 ↗	(On Diff #28490)	ExistingBB is already computed in the caller, there is no need for the interface change here ..
797 ↗	(On Diff #28490)	Though equivalent, but it might be more readable if it computes 'benefit/savings' in terms of the fall through opportunities instead of the cost. For instance, in the following three items, 1 and 2 are really savings that come with the rotation, and 3 is the saving that will be lost with the candidate layout (including the original layout before the rotation).
818 ↗	(On Diff #28490)	Why is this loop outside of the following loop? I expect each new candidate loop header to have a chance to check the fall through opportunity and related saving/cost.
821 ↗	(On Diff #28490)	If the previous chain does not end up connecting with this new top block in the current loop chain, then the benefit of the fall through won't be materializable, for instance when the Pred has more than 1 successors and the other successor is also a chain lead.
828 ↗	(On Diff #28490)	I think the adjustment should be done the other way: when num of successors > 1, the cost will be adjust to the minimal freq of the Pred's outgoing edge -- that is where the unconditional jump will be inserted (assuming the fall through path is the hottest one)
846 ↗	(On Diff #28490)	See the previous comment. Even when Iter is not the current Head, it may still have fall through opportunity.
855 ↗	(On Diff #28490)	Do you also want to check if the Succ block has more than one incoming edges. If it has more than one incoming edges, need to make sure 'Succ' block's other predecessor is not a tail of another chain that can be connnected with 'Succ'.
870 ↗	(On Diff #28490)	This needs more explanation.
939 ↗	(On Diff #28490)	I suggest adding another intern option to guard this. if (UseProfileBasedCostAnalysis && F.getFunction()->getEntryCount()) ...
941 ↗	(On Diff #28490)	the check does not seem necessary. If LoopTop == header, ExitingBB will be null.

congh added inline comments.Jun 26 2015, 4:08 PM

lib/CodeGen/MachineBlockPlacement.cpp
250 ↗	(On Diff #28490)	This is only an internal "interface". The reason why I moved the computation of ExitingBB to inside of this function is because ExitingBB is only needed by rotateLoop and it doesn't make sense to compute it outside of this function. This makes the logic more compact and clear: it is rotateLoop but not buildLoopChains who needs ExitingBB.
250 ↗	(On Diff #28490)	I just checked this part again, and I found I was wrong: the exiting BB is calculated before building the loop chain. This should be guaranteed as when finding best exit BB it is assumed that BBs haven't been merged to the loop chain. I will update the patch by rolling this part back.
761 ↗	(On Diff #28490)	Please see my comment above.
797 ↗	(On Diff #28490)	Agree. I will update the comment.
818 ↗	(On Diff #28490)	Here we only consider natural loops, which have only one entry (loop header). As there is only one loop header, we only need to compute the fall-through once. We only check this benefit if the loop top is the loop header.
821 ↗	(On Diff #28490)	That is right. But at this point we haven't started to build chains outside of this loop. It is difficult to tell if Pred will or will not be connected to the loop header. This is just a heuristic-like optimization. We could not guarantee global optimal result.
828 ↗	(On Diff #28490)	Consider a two-way branch from Pred with frequencies x and y, and x is the frequency from Pred to the loop header. When x > y, and the edge is not fall-thru, then the number of jumps is x + y + y, and branch cost is x + y; if it is a fall-thru, the number of jumps is x + y, and the branch cost is y. If we take the weight of branch cost and jump instruction to be the same, then we can save x+y by making that edge fall-thru. If x < y, the edge from Pred to the loop header won't be fall-thru and we save nothing. So the conclusion is: if num of successors > 1 and the edge from Pred to the loop header is the most frequent one, then making this edge fall-thru can save us the freq of Pred. If that edge is not the most frequent one, we could save nothing. If num of successors == 1, we can save 2x freq of Pred.
846 ↗	(On Diff #28490)	The fall-thru from Pred to another successor that is not the header? Then loop rotation won't affect that fall-thru.
855 ↗	(On Diff #28490)	The function findBestLoopExit() doesn't consider this possibility. If we check this, should we go ahead and check if the source of the more frequent incoming edge has a more frequent outgoing edge... I am afraid this may make things too complicated.
939 ↗	(On Diff #28490)	OK.
941 ↗	(On Diff #28490)	This is because I changed the position of findBestLoopExit(). As I will resume its position, I will also update this part.

The updated patch according to David's comments.

davidxl added inline comments.Jun 26 2015, 5:11 PM

lib/CodeGen/MachineBlockPlacement.cpp
818 ↗	(On Diff #28490)	Ok -- a comment there about that will be useful.
821 ↗	(On Diff #28490)	Is it worthwhile considering doing loop rotation after all chains have been formed?
828 ↗	(On Diff #28490)	case 1: num_successors (Pred) == 2, and x > y if Pred->Header edge is not the fall through, assuming the other edge will become a fall through edge , then the costs will be: number of jumps executed: (x+y), cost of jump is C1 number of taken branches: x, cost of branch taken is C2: Total cost: C1* (x+y) + C2* x if Pred->Header edge is the fall through, number of jumps executed: (x+y) number of taken branches: y Total cost: C1 * (x+y) + C2 * y so the saving (of keeping the Pred->header fall through) is C2 * (x - y) case 2: number of successors (Pred) == 2, x < y The saving of keeping Pred->header to be fall through is still C2 * (x - y ), but negative. case 3: number of successors (Pred) == 1 Keeping the fall through, the total cost is 0, while not keeping the fall through the cost is (C1+C2)x Conclusion a) when number of successors > 1, the real cost of not doing the fall through should be adjusted to C2 (x-y) b) when the number of successors == 1, your estimate is correct, assuming C1 == C2 c) better introduce a cost parameter for type1 and type2
846 ↗	(On Diff #28490)	This is explained by you -- only natural loop is considered.
855 ↗	(On Diff #28490)	you need keep an eye on it and consider not doing loop rotation on the fly and chain merging on the fly. The decision of rotation and how chains are merged/connected need to be made together.
941 ↗	(On Diff #28490)	Another question is: if the current layout algorithm finds a different 'best' loop top than the header, than the profile driven/cost based rotation will be suppressed, which is not desirable.

Cong, the proposed cost analysis can not handle diamond shaped or loops with more complicated control flow. To handle those cases, I think you need to generalize it by computing the complete branch costs for all rotation candidates and select the one with the least cost. The entry and exit connection cost computation remains the same. To compute taken branch cost, there are only 4 cases to consider: tail BB with 1 successor, tail BB multiple successors, non-tail BB with 1 successor, non-tail BB with multiple successors.

B0
do {
B1
if (..) {

B2

} else {

B3

}
B4
} while (..);
B5;

If the original layout keeps the topo order:
B0 B1 B2 B3 B4 B5

The cost analysis should discover the optimal rotation to be
B0 B3 B4 B1 B2 B5

congh added inline comments.Jun 29 2015, 10:53 AM

lib/CodeGen/MachineBlockPlacement.cpp
818 ↗	(On Diff #28490)	OK. Thanks!
821 ↗	(On Diff #28490)	If all chains are already formed, loop rotation may not be done as BBs outside of the loop are already merged into chains. For example, BBs just after loops' exits may not be chain tops. I thinks LLVM's bottom-up method makes sense here.
828 ↗	(On Diff #28490)	Note that here we focus on loop rotation. In the case of x > y, the other edge with frequency y won't become a fall through edge. If the header is not the chain top, then Pred will chose another loop BB other than the header or the other successor outside of the loop as its successor in the layout. That is why the number of jumps executed is x + y + y (in this case there both of its CFG successors are not fall-thru so there are two jumps). Following this analysis (as in my last post), we could see that we can save (C1 + C2) * y by making the header as the top of the loop chain.
855 ↗	(On Diff #28490)	In LLVM' current bottom-up algorithm, those two processes are separated by design. Even we delay the loop rotation until forming chains outside of it, if it has several exits, the situation is still complicated (single-entry single-exit loops are much easier to handle as they can be reduced to one CFG node). The loop rotation algorithm is based on the heuristic that the loop exit at the chain bottom are much likely to have a fall-thru edge to the outside.
870 ↗	(On Diff #28490)	Done.
941 ↗	(On Diff #28490)	Actually, if we use the profile based loop rotation algorithm, we could discard the current method to find the best loop top and exit. This is because our algorithm already considers them.

In D10717#196064, @davidxl wrote:
Cong, the proposed cost analysis can not handle diamond shaped or loops with more complicated control flow. To handle those cases, I think you need to generalize it by computing the complete branch costs for all rotation candidates and select the one with the least cost. The entry and exit connection cost computation remains the same. To compute taken branch cost, there are only 4 cases to consider: tail BB with 1 successor, tail BB multiple successors, non-tail BB with 1 successor, non-tail BB with multiple successors.

B0
do {
B1
if (..) {
B2
} else {
B3
}
B4
} while (..);
B5;

If the original layout keeps the topo order:
B0 B1 B2 B3 B4 B5

The cost analysis should discover the optimal rotation to be
B0 B3 B4 B1 B2 B5

As we know, the current LLVM's algorithm always form a chain with all BBs in a loop, which is a drawback. I will propose another patch that changes this behavior. In the new algorithm, the loop chain formation may stop as long as an exit is reached, and only this loop chain will be rotated. This is similar to GCC's strategy. In the above example, once B1->B3->B4 is formed, the loop chain build is complete. Then B1 is the best candidate to be the chain bottom as there is an fall-thru edge from B1 to B2.

Add one more test case with a diamond structure in the loop.

Update the patch with the new cost analysis of the tail to head in the loop chain.

congh updated this revision to Diff 28815.Jun 30 2015, 2:29 PM

Update the patch by adding two opt parameters that define the cost of misfetch and jump instruction, and use them when rotating loops.

Added one more test case with a nested loop.

Chandler, could you please take a look at this patch and see if it is ready to commit?

Sorry for delays getting to this patch.

lib/CodeGen/MachineBlockPlacement.cpp
84–86 ↗	(On Diff #29482)	This seems confusing to me. We use profile data to guide many parts of block placement. This flag doesn't guard most of them. Could you select a more accurate name?
89–92 ↗	(On Diff #29482)	It might be useful to clarify that this cost models the probabalistic risk of an instruction misfetch due to a jump relative to falling through?
94–96 ↗	(On Diff #29482)	We have lots of existing measures of the cost of an instruction already. It would probably help to clarify this is a weight specifically designed to fit with the scale of the misfetch cost above? Not really sure how important it is to be able to tune these two costs independently.
269–271 ↗	(On Diff #29482)	My assumption is that the flag is expected to go away at some point. As such, I wouldn't embed a predicate here. I would sink this into the code that actually needs to test these properties to make decisions.
815–826 ↗	(On Diff #29482)	"we could determine" -> "we can determine" Also, I find the mixture of "benefit" and "cost" confusing in this comment. I would try to use more clear terms here.
831 ↗	(On Diff #29482)	Please don't use floating point to compute these kinds of things. We would like LLVM to not depend no the vagaries of the FPU and would like to avoid any possible numerical instability manifesting as non-deterministic compiles. Instead, you should compute things using the BlockFrequency and BranchProbability abstract units which do the fixed-point math for you.
846–849 ↗	(On Diff #29482)	Why is a conditional jump's benefit not considered?
854–856 ↗	(On Diff #29482)	There is no comment explaining the actual algorithm at this point. Please add one. Especially with a loop structure this complex (three iterators??) it would help to give the reader some context for how this loop is supposed to operate.
868 ↗	(On Diff #29482)	This seems to be a cost rather than a benefit?
877–879 ↗	(On Diff #29482)	You should compute the max edge probability, and factor in the frequency outside of the loop.

This revision now requires changes to proceed.Jul 24 2015, 11:21 AM

davidxl added inline comments.Jul 24 2015, 1:46 PM

lib/CodeGen/MachineBlockPlacement.cpp
84–86 ↗	(On Diff #29482)	yes -- something like 'precise-rotation-cost' would be clearer.

congh marked 9 inline comments as done.Jul 24 2015, 3:43 PM

congh added inline comments.

lib/CodeGen/MachineBlockPlacement.cpp
84–86 ↗	(On Diff #29482)	I have chosen the name suggested by David.
89–92 ↗	(On Diff #29482)	I have updated the description following your suggestion.
94–96 ↗	(On Diff #29482)	OK. I have updated the description. We actually only need one ratio between those two costs. But I think it would be more clear to give two costs separately in case we need other costs in the future (like mispredicate?).
269–271 ↗	(On Diff #29482)	OK. Now this is only called once and it is OK to "inline" this function there.
815–826 ↗	(On Diff #29482)	You are right. I have updated the comment by clearly stating which item is a benefit and which one is a cost.
831 ↗	(On Diff #29482)	Thanks for explaining the background! I have used a pair of BlockFrequency to express benefit and cost respectively.
846–849 ↗	(On Diff #29482)	You mean when Pred->succ_size() > 1? In that case we assume Pred will have the other successor as the layout successor, which is very likely to happen, and then we will save nothing (there are some cases that we can guess at this point that Pred won't have the other successor as fall-through. We could do this optimization here but I think it makes the analysis too complicated).
854–856 ↗	(On Diff #29482)	OK. I have added a comment explains what this loop does. The TailIter is also described just below this statement.
868 ↗	(On Diff #29482)	This is a benefit because then we have a fallthrough from a block outside of the loop to the loop header. I think my comment here is incorrect and sorry about it (it is corrected now).
877–879 ↗	(On Diff #29482)	Yes, this is more efficient. Done.

Chandler, I have updated the patch according to your comments. PTAL. Thanks!

chandlerc added inline comments.Jul 24 2015, 7:21 PM

lib/CodeGen/MachineBlockPlacement.cpp
816–827 ↗	(On Diff #30619)	I think it would be clearer to talk only in terms of cost. Below, when computing it, you can add and subtract costs to find the minimum. Consider wording along the lines of: There are three costs we want to minimize stemming from a jump rather than a fall through edge: Entering the loop, Exiting the loop, and Continuing to loop by returning to the header. Does that make sense? If not, then maybe I'm misunderstanding what you're trying to say more fundamentally.
834–838 ↗	(On Diff #30619)	As above, I think it would be useful to have just cost, and to use the natural addition and subtraction to find the minimum cost rotation. Also, "hopefully there is no overflow" seems a bad sign. We should definitely have no overflow, and I think it makes sense to add asserts and such to enforce that.
902 ↗	(On Diff #30619)	Tragically, this loop is still quadratic in complexity. getEdgeProbability is really inefficient when queried in this way. You can find other code in this file to compute the best successor probability without hitting this quadratic behavior. We should really figure out an API for this in BPI but for now, I would suggest extracting a helper and using it here.
912–916 ↗	(On Diff #30619)	What if one of the successors is outside the loop? (Also, please add a test case covering this?)
846–849 ↗	(On Diff #29482)	I'm curious what data makes you confident that the other successor will end up the layout successor? It seems strange to me to assume that because there are more than one successors of the non-loop-predecessor of the header, the loop header won't be the layout-successor. Naively, I would expect a common pattern of: E -- \| \| \| v \| LH <- \| \| \| \| v \| \| L2 -- \| \| v \| X <- Where LH is the loop header, and E is the entering block and X is the exit block. With this, it is perfectly conceivable that LH would be the layout successor of E.

davidxl added inline comments.Jul 24 2015, 9:03 PM

lib/CodeGen/MachineBlockPlacement.cpp
816–827 ↗	(On Diff #30619)	More precisely, the objective is to find a rotation with minimal overall cost or equivalently maximal the overall benefit. The cost/benefit function has 3 components.
834–838 ↗	(On Diff #30619)	Cong is actually computing the overall benefit and maximize it. Cong, It is equivalent to compute cost only -- simply change it to LoopExternalConnectionCost. For rotation with positive benefit in current patch, the connection cost will be 0, while for rotation with no benefit, the connection cost will be the branch freq to the head (or branch out freq)
847–850 ↗	(On Diff #30619)	I believe I raised similar question before (see comment and replies at line 818). I think the problem is that we can not be very sure if this loop's header will or will not be the layout successor of E due to the loop based layout ordering constraint (In fact in your example, unless the branch is highly biased towards LH, it is likely X will be the layout successor of E). That is why I proposed at some point that loop rotation be done when base layout is done for the function -- however that may require too big a change. It might worth considering obvious beneficial conditional jump here.
902 ↗	(On Diff #30619)	yes -- can consider computing TailBB's sum weight outside the loop.
912–916 ↗	(On Diff #30619)	this should not matter.
915 ↗	(On Diff #30619)	I don't think this formula is correct. The correct formula is (x-x')MisfetchCost + yJumpInstrCost Where x' is the frequency of the edge from the tail node to the block which is not its layout successor (before the rotation).

congh added inline comments.Jul 26 2015, 11:24 PM

lib/CodeGen/MachineBlockPlacement.cpp
816–827 ↗	(On Diff #30619)	Yes. This makes more sense. But from the implementation aspect, BlockFrequency doesn't provide subtract operation. This is why I use both benefit and cost. I could scale down all frequencies so that I can use signed 64bit integer to calculate the cost or benefit.
834–838 ↗	(On Diff #30619)	The arithmetic operation defined in BlockFrequency already considers overflow, but it may return the saturated result, which may not be what we want. I may have to use the similar way that is used in getSumForBlock() to prevent potential overflow.
834–838 ↗	(On Diff #30619)	David, I am not clear how LoopExternalConnectionCost is working: do we only need this cost, or do we have other cost/benefit? I ask this because there may exist more than one rotation that is beneficial.
847–850 ↗	(On Diff #30619)	Without knowing the layout outside of the loop, we do not know if the loop header will or will not be the layout-successor of its predecessor outside of the loop. However, this is the case when we do loop rotation (unless we change the design as suggested by David). But if a loop has many iterations, the benefit of the loop header's fall-through would be trivial comparing to the cost from the loop chain's tail to its head (similar assumption for the exit fall-through). If not, the fall-through to the loop header is very important to the overall cost and we'd better hope that the fall-through will be there in the final layout. That is why I think that we should always take this fall-through into consideration.
902 ↗	(On Diff #30619)	Yes, thanks for pointing it out. I will calculate the sum of weight outside of the loop to avoid redundant calculations. I will consider to design an API in BPI to help us here.
912–916 ↗	(On Diff #30619)	That would be an exit and the exit fall-through is also considered here. I will add a test case about it (maybe a loop with several exits with different frequencies).
915 ↗	(On Diff #30619)	We don't have to compare each rotation with the current layout before rotation. We just calculate its own cost. Your formula is equivalent to the current one: you subtract x' * MisfetchCost from every rotation (so the layout before rotation is x' - x' = 0 here).

davidxl added inline comments.Jul 27 2015, 12:53 AM

lib/CodeGen/MachineBlockPlacement.cpp
834–838 ↗	(On Diff #30619)	For each loop rotation, the entry-connection cost will be the sum of frequencies (multiplied by misfetch cost) of all incoming edges that are not 'fall through' edges. Similarly, the exit-connection cost can be computed.
915 ↗	(On Diff #30619)	What is needed is the cost introduced due to the rotation, however my version of the cost function is indeed incorrect. We can actually prove it is the following: (xMisFetchCost + min(x,y)JumpInstCost)TailNodeFreq Note that this is slightly different from yours -- x is not the larger of the probabilities, but the original fall through probability. The second term is the same. Suppose before the rotation, the layout is ... BB1,... TailBB, BB2 where TailBB is ended with 'jcc BB1'. The fall through (not taken) prob is x, and taken probability is y. After rotation, the layout is BB2, ... BB1, .. TailBB. There are two cases: x > y. The jump instructions added to TailBB will be jcc' BB2 followed by direct jump: jmp BB1 Rotation cost will be: xMisFetchCost + yMisFetchCost + y JumpInstCost - yMisFetchCost = xMisFetchCost + y * JumpInstCost x < y, The jump instructions added will be jcc BB1 + jmp BB2 rotation cost is: x * MisFetchCost + x*JumpInstCost Combine 1) and 2),

congh added inline comments.Jul 27 2015, 1:48 PM

lib/CodeGen/MachineBlockPlacement.cpp
834–838 ↗	(On Diff #30619)	I just understand how to define LoopExternalConnectionCost: a loop rotation with the chain head that is not the loop header will have this cost, which actually is equal to the benefit I am now using. But I don't agree with your latest definition of it: we only need to consider the frequency of the potential fall-through edge from the header's predecessor to the header.
915 ↗	(On Diff #30619)	Yes, your analysis is correct. Originally I assume x >= y as I assume when BB2 is chosen to be the layout-successor of TailBB during loop chain build, it has larger edge frequency than other edges. However, I found this assumption may not hold (considering critical edge cases). I will update the code following your analysis. Thanks a lot!

congh added inline comments.Jul 27 2015, 2:06 PM

lib/CodeGen/MachineBlockPlacement.cpp
834–838 ↗	(On Diff #30619)	Also, I found it difficult to calculate the exit-connection cost: there may exist several exits in the loop, and at first we don't know how much cost should be given to each rotation. Even we find the largest exit frequency from another loop, we could not avoid subtract operation on costs (represented by BlockFrequency without this operation defined) when a rotation has a tail block with an exit edge that is less frequent. I am now afraid that in order to unify cost and benefit we may pay more here.

Measure the loop rotation only in cost instead of cost+benefit.

Improve the inefficient code that computes branch probabilities.

Ping on this patch?

Update the patch by skipping getting best loop top and exit when profile data is available.

PTAL.

Ping?

congh added a reviewer: djasper.Sep 10 2015, 1:18 PM

Just a minor comment about the wording of the options. Otherwise, I don't think I'll add much to the reviews by both Chandler and David.

lib/CodeGen/MachineBlockPlacement.cpp
93 ↗	(On Diff #31424)	In what way is the cost "relative"? In my understanding, this can either mean that it is a difference or a ratio. In either case "unsigned" would be a weird data type.
99 ↗	(On Diff #31424)	Same as above, what do you actually mean by "relative"?

Thanks for reviewing those patches, Daniel! (Sorry for the delayed reply as I just came back from vacation).

lib/CodeGen/MachineBlockPlacement.cpp
93 ↗	(On Diff #31424)	Here we assume that fall-through will have zero cost comparing to 1 for non-fall-through. As this is not a "relative" cost, I have updated this description to: Cost that models the probabilistic risk of an instruction misfetch due to a jump comparing to falling through, whose cost is zero.
99 ↗	(On Diff #31424)	This is a relative cost: if the cost of a misfetch is 1, then we assume the cost of a jump instruction is also 1 (we can adjust those two options to change their ratio if necessary).

Update the patch according to Daniel's comment.

davidxl added inline comments.Oct 19 2015, 2:26 PM

lib/CodeGen/MachineBlockPlacement.cpp
92 ↗	(On Diff #37153)	Fix typo: probabilistic.
99 ↗	(On Diff #37153)	From the way this parameter is used below, it does not look like a relative cost, but more absolute cost.
842 ↗	(On Diff #37153)	Why not just multiply with Scale?
855 ↗	(On Diff #37153)	Why is the result called 'Freq' -- it should really be the cost.
893 ↗	(On Diff #37153)	Fix indentation.
922 ↗	(On Diff #37153)	Fix the comment. It is not 'we still get an additional ..', but 'we may still get an additional jmp if the exit bb does not have its target as the layout successor'.
925 ↗	(On Diff #37153)	chair --> chain
938 ↗	(On Diff #37153)	does this not consider the case when a direct jump is not needed?

congh marked 3 inline comments as done.Oct 19 2015, 3:15 PM

congh added inline comments.

lib/CodeGen/MachineBlockPlacement.cpp
99 ↗	(On Diff #37153)	OK. I will remove the word "relative" here.
842 ↗	(On Diff #37153)	There is no operator * defined between a BlockFrequency and an integer. Also, this / operator is defined to be saturating. I will add a comment for explanation.
855 ↗	(On Diff #37153)	I think it is due to some historic reason. Fixed.
893 ↗	(On Diff #37153)	This is the result from running clang-format.
922 ↗	(On Diff #37153)	Note that we are sure that the two BBs involved here were layout together before, and the second one is a successor of the first one. So we are sure that there will be an additional jump.
938 ↗	(On Diff #37153)	This should not happen. For this rotation, the TailBB is the tail and its layout successor must be a BB that is outside of the loop chain.

Update the patch according to David's comment.

davidxl added inline comments.Oct 19 2015, 3:33 PM

lib/CodeGen/MachineBlockPlacement.cpp
920 ↗	(On Diff #37803)	The situation is TailBB->currentChainHead TailBB->OtherSucessor After rotation, TailBB->OtherSuccessor may be a fall through, so there is no need for a jump for that case. If OtherSuccessor is inside the chain, then a jmp is definitely needed.

congh added inline comments.Oct 19 2015, 3:45 PM

lib/CodeGen/MachineBlockPlacement.cpp
920 ↗	(On Diff #37803)	OK. This makes sense. As we don't know the layout successor of the loop chain, we only need to update this comment.

Update the patch by fixing a comment.

This version LGTM as it is not enabled by default yet. There are a couple of followups to do:

collect extensively performance data (including PMU perf data), and possibly tune the parameters (for selected targets)
turn it on by default when tuning is done

think about how to better handle loop rotation cost analysis with the right ordering --- current way (from inside out) is less than ideal, as the compiler has to guess how header/exit blocks are connected with blocks in the enclosing chain (it has not been laid out yet).

congh updated this object.Oct 19 2015, 4:17 PM

Closed by commit rL250754: Enhance loop rotation with existence of profile data in MachineBlockPlacement… (authored by conghou). · Explain WhyOct 19 2015, 4:18 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

MachineBlockPlacement.cpp

187 lines

test/

CodeGen/

X86/

code_placement_loop_rotation.ll

80 lines

code_placement_loop_rotation2.ll

122 lines

Diff 37815

llvm/trunk/lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	static cl::opt<bool> OutlineOptionalBranches(
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

static cl::opt<unsigned> OutlineOptionalThreshold(		static cl::opt<unsigned> OutlineOptionalThreshold(
"outline-optional-threshold",		"outline-optional-threshold",
cl::desc("Don't outline optional branches that are a single block with an "		cl::desc("Don't outline optional branches that are a single block with an "
"instruction count below this threshold"),		"instruction count below this threshold"),
cl::init(4), cl::Hidden);		cl::init(4), cl::Hidden);

		static cl::opt<bool>
		PreciseRotationCost("precise-rotation-cost",
		cl::desc("Model the cost of loop rotation more "
		"precisely by using profile data."),
		cl::init(false), cl::Hidden);

		static cl::opt<unsigned> MisfetchCost(
		"misfetch-cost",
		cl::desc("Cost that models the probablistic risk of an instruction "
		"misfetch due to a jump comparing to falling through, whose cost "
		"is zero."),
		cl::init(1), cl::Hidden);

		static cl::opt<unsigned> JumpInstCost("jump-inst-cost",
		cl::desc("Cost of jump instructions."),
		cl::init(1), cl::Hidden);

namespace {		namespace {
class BlockChain;		class BlockChain;
/// \brief Type for our function-wide basic block -> block chain mapping.		/// \brief Type for our function-wide basic block -> block chain mapping.
typedef DenseMap<MachineBasicBlock , BlockChain > BlockToChainMapType;		typedef DenseMap<MachineBasicBlock , BlockChain > BlockToChainMapType;
}		}

namespace {		namespace {
/// \brief A chain of blocks which will be laid out contiguously.		/// \brief A chain of blocks which will be laid out contiguously.
▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	void buildChain(MachineBasicBlock *BB, BlockChain &Chain,
const BlockFilterSet *BlockFilter = nullptr);		const BlockFilterSet *BlockFilter = nullptr);
MachineBasicBlock *findBestLoopTop(MachineLoop &L,		MachineBasicBlock *findBestLoopTop(MachineLoop &L,
const BlockFilterSet &LoopBlockSet);		const BlockFilterSet &LoopBlockSet);
MachineBasicBlock *findBestLoopExit(MachineFunction &F, MachineLoop &L,		MachineBasicBlock *findBestLoopExit(MachineFunction &F, MachineLoop &L,
const BlockFilterSet &LoopBlockSet);		const BlockFilterSet &LoopBlockSet);
void buildLoopChains(MachineFunction &F, MachineLoop &L);		void buildLoopChains(MachineFunction &F, MachineLoop &L);
void rotateLoop(BlockChain &LoopChain, MachineBasicBlock *ExitingBB,		void rotateLoop(BlockChain &LoopChain, MachineBasicBlock *ExitingBB,
const BlockFilterSet &LoopBlockSet);		const BlockFilterSet &LoopBlockSet);
		void rotateLoopWithProfile(BlockChain &LoopChain, MachineLoop &L,
		const BlockFilterSet &LoopBlockSet);
void buildCFGChains(MachineFunction &F);		void buildCFGChains(MachineFunction &F);

public:		public:
static char ID; // Pass identification, replacement for typeid		static char ID; // Pass identification, replacement for typeid
MachineBlockPlacement() : MachineFunctionPass(ID) {		MachineBlockPlacement() : MachineFunctionPass(ID) {
initializeMachineBlockPlacementPass(*PassRegistry::getPassRegistry());		initializeMachineBlockPlacementPass(*PassRegistry::getPassRegistry());
}		}

▲ Show 20 Lines • Show All 526 Lines • ▼ Show 20 Lines	void MachineBlockPlacement::rotateLoop(BlockChain &LoopChain,
BlockChain::iterator ExitIt =		BlockChain::iterator ExitIt =
std::find(LoopChain.begin(), LoopChain.end(), ExitingBB);		std::find(LoopChain.begin(), LoopChain.end(), ExitingBB);
if (ExitIt == LoopChain.end())		if (ExitIt == LoopChain.end())
return;		return;

std::rotate(LoopChain.begin(), std::next(ExitIt), LoopChain.end());		std::rotate(LoopChain.begin(), std::next(ExitIt), LoopChain.end());
}		}

		/// \brief Attempt to rotate a loop based on profile data to reduce branch cost.
		///
		/// With profile data, we can determine the cost in terms of missed fall through
		/// opportunities when rotating a loop chain and select the best rotation.
		/// Basically, there are three kinds of cost to consider for each rotation:
		/// 1. The possibly missed fall through edge (if it exists) from BB out of
		/// the loop to the loop header.
		/// 2. The possibly missed fall through edges (if they exist) from the loop
		/// exits to BB out of the loop.
		/// 3. The missed fall through edge (if it exists) from the last BB to the
		/// first BB in the loop chain.
		/// Therefore, the cost for a given rotation is the sum of costs listed above.
		/// We select the best rotation with the smallest cost.
		void MachineBlockPlacement::rotateLoopWithProfile(
		BlockChain &LoopChain, MachineLoop &L, const BlockFilterSet &LoopBlockSet) {
		auto HeaderBB = L.getHeader();
		auto HeaderIter = std::find(LoopChain.begin(), LoopChain.end(), HeaderBB);
		auto RotationPos = LoopChain.end();

		BlockFrequency SmallestRotationCost = BlockFrequency::getMaxFrequency();

		// A utility lambda that scales up a block frequency by dividing it by a
		// branch probability which is the reciprocal of the scale.
		auto ScaleBlockFrequency = [](BlockFrequency Freq,
		unsigned Scale) -> BlockFrequency {
		if (Scale == 0)
		return 0;
		// Use operator / between BlockFrequency and BranchProbability to implement
		// saturating multiplication.
		return Freq / BranchProbability(1, Scale);
		};

		// Compute the cost of the missed fall-through edge to the loop header if the
		// chain head is not the loop header. As we only consider natural loops with
		// single header, this computation can be done only once.
		BlockFrequency HeaderFallThroughCost(0);
		for (auto *Pred : HeaderBB->predecessors()) {
		BlockChain *PredChain = BlockToChain[Pred];
		if (!LoopBlockSet.count(Pred) &&
		(!PredChain \|\| Pred == *std::prev(PredChain->end()))) {
		auto EdgeFreq =
		MBFI->getBlockFreq(Pred) * MBPI->getEdgeProbability(Pred, HeaderBB);
		auto FallThruCost = ScaleBlockFrequency(EdgeFreq, MisfetchCost);
		// If the predecessor has only an unconditional jump to the header, we
		// need to consider the cost of this jump.
		if (Pred->succ_size() == 1)
		FallThruCost += ScaleBlockFrequency(EdgeFreq, JumpInstCost);
		HeaderFallThroughCost = std::max(HeaderFallThroughCost, FallThruCost);
		}
		}

		// Here we collect all exit blocks in the loop, and for each exit we find out
		// its hottest exit edge. For each loop rotation, we define the loop exit cost
		// as the sum of frequencies of exit edges we collect here, excluding the exit
		// edge from the tail of the loop chain.
		SmallVector<std::pair<MachineBasicBlock *, BlockFrequency>, 4> ExitsWithFreq;
		for (auto BB : LoopChain) {
		uint32_t LargestExitEdgeWeight = 0;
		for (auto *Succ : BB->successors()) {
		BlockChain *SuccChain = BlockToChain[Succ];
		if (!LoopBlockSet.count(Succ) &&
		(!SuccChain \|\| Succ == *SuccChain->begin())) {
		uint32_t SuccWeight = MBPI->getEdgeWeight(BB, Succ);
		LargestExitEdgeWeight = std::max(LargestExitEdgeWeight, SuccWeight);
		}
		}
		if (LargestExitEdgeWeight > 0) {
		uint32_t WeightScale = 0;
		uint32_t SumWeight = MBPI->getSumForBlock(BB, WeightScale);
		auto ExitFreq =
		MBFI->getBlockFreq(BB) *
		BranchProbability(LargestExitEdgeWeight / WeightScale, SumWeight);
		ExitsWithFreq.emplace_back(BB, ExitFreq);
		}
		}

		// In this loop we iterate every block in the loop chain and calculate the
		// cost assuming the block is the head of the loop chain. When the loop ends,
		// we should have found the best candidate as the loop chain's head.
		for (auto Iter = LoopChain.begin(), TailIter = std::prev(LoopChain.end()),
		EndIter = LoopChain.end();
		Iter != EndIter; Iter++, TailIter++) {
		// TailIter is used to track the tail of the loop chain if the block we are
		// checking (pointed by Iter) is the head of the chain.
		if (TailIter == LoopChain.end())
		TailIter = LoopChain.begin();

		auto TailBB = *TailIter;

		// Calculate the cost by putting this BB to the top.
		BlockFrequency Cost = 0;

		// If the current BB is the loop header, we need to take into account the
		// cost of the missed fall through edge from outside of the loop to the
		// header.
		if (Iter != HeaderIter)
		Cost += HeaderFallThroughCost;

		// Collect the loop exit cost by summing up frequencies of all exit edges
		// except the one from the chain tail.
		for (auto &ExitWithFreq : ExitsWithFreq)
		if (TailBB != ExitWithFreq.first)
		Cost += ExitWithFreq.second;

		// The cost of breaking the once fall-through edge from the tail to the top
		// of the loop chain. Here we need to consider three cases:
		// 1. If the tail node has only one successor, then we will get an
		// additional jmp instruction. So the cost here is (MisfetchCost +
		// JumpInstCost) * tail node frequency.
		// 2. If the tail node has two successors, then we may still get an
		// additional jmp instruction if the layout successor after the loop
		// chain is not its CFG successor. Note that the more frequently executed
		// jmp instruction will be put ahead of the other one. Assume the
		// frequency of those two branches are x and y, where x is the frequency
		// of the edge to the chain head, then the cost will be
		// (x * MisfetechCost + min(x, y) * JumpInstCost) * tail node frequency.
		// 3. If the tail node has more than two successors (this rarely happens),
		// we won't consider any additional cost.
		if (TailBB->isSuccessor(*Iter)) {
		auto TailBBFreq = MBFI->getBlockFreq(TailBB);
		if (TailBB->succ_size() == 1)
		Cost += ScaleBlockFrequency(TailBBFreq.getFrequency(),
		MisfetchCost + JumpInstCost);
		else if (TailBB->succ_size() == 2) {
		auto TailToHeadProb = MBPI->getEdgeProbability(TailBB, *Iter);
		auto TailToHeadFreq = TailBBFreq * TailToHeadProb;
		auto ColderEdgeFreq = TailToHeadProb > BranchProbability(1, 2)
		? TailBBFreq * TailToHeadProb.getCompl()
		: TailToHeadFreq;
		Cost += ScaleBlockFrequency(TailToHeadFreq, MisfetchCost) +
		ScaleBlockFrequency(ColderEdgeFreq, JumpInstCost);
		}
		}

		DEBUG(dbgs() << "The cost of loop rotation by making " << getBlockNum(*Iter)
		<< " to the top: " << Cost.getFrequency() << "\n");

		if (Cost < SmallestRotationCost) {
		SmallestRotationCost = Cost;
		RotationPos = Iter;
		}
		}

		if (RotationPos != LoopChain.end()) {
		DEBUG(dbgs() << "Rotate loop by making " << getBlockNum(*RotationPos)
		<< " to the top\n");
		std::rotate(LoopChain.begin(), RotationPos, LoopChain.end());
		}
		}

/// \brief Forms basic block chains from the natural loop structures.		/// \brief Forms basic block chains from the natural loop structures.
///		///
/// These chains are designed to preserve the existing structure of the code		/// These chains are designed to preserve the existing structure of the code
/// as much as possible. We can then stitch the chains together in a way which		/// as much as possible. We can then stitch the chains together in a way which
/// both preserves the topological structure and minimizes taken conditional		/// both preserves the topological structure and minimizes taken conditional
/// branches.		/// branches.
void MachineBlockPlacement::buildLoopChains(MachineFunction &F,		void MachineBlockPlacement::buildLoopChains(MachineFunction &F,
MachineLoop &L) {		MachineLoop &L) {
// First recurse through any nested loops, building chains for those inner		// First recurse through any nested loops, building chains for those inner
// loops.		// loops.
for (MachineLoop *InnerLoop : L)		for (MachineLoop *InnerLoop : L)
buildLoopChains(F, *InnerLoop);		buildLoopChains(F, *InnerLoop);

SmallVector<MachineBasicBlock *, 16> BlockWorkList;		SmallVector<MachineBasicBlock *, 16> BlockWorkList;
BlockFilterSet LoopBlockSet(L.block_begin(), L.block_end());		BlockFilterSet LoopBlockSet(L.block_begin(), L.block_end());

		// Check if we have profile data for this function. If yes, we will rotate
		// this loop by modeling costs more precisely which requires the profile data
		// for better layout.
		bool RotateLoopWithProfile =
		PreciseRotationCost && F.getFunction()->getEntryCount();

// First check to see if there is an obviously preferable top block for the		// First check to see if there is an obviously preferable top block for the
// loop. This will default to the header, but may end up as one of the		// loop. This will default to the header, but may end up as one of the
// predecessors to the header if there is one which will result in strictly		// predecessors to the header if there is one which will result in strictly
// fewer branches in the loop body.		// fewer branches in the loop body.
MachineBasicBlock *LoopTop = findBestLoopTop(L, LoopBlockSet);		// When we use profile data to rotate the loop, this is unnecessary.
		MachineBasicBlock *LoopTop =
		RotateLoopWithProfile ? L.getHeader() : findBestLoopTop(L, LoopBlockSet);

// If we selected just the header for the loop top, look for a potentially		// If we selected just the header for the loop top, look for a potentially
// profitable exit block in the event that rotating the loop can eliminate		// profitable exit block in the event that rotating the loop can eliminate
// branches by placing an exit edge at the bottom.		// branches by placing an exit edge at the bottom.
MachineBasicBlock *ExitingBB = nullptr;		MachineBasicBlock *ExitingBB = nullptr;
if (LoopTop == L.getHeader())		if (!RotateLoopWithProfile && LoopTop == L.getHeader())
ExitingBB = findBestLoopExit(F, L, LoopBlockSet);		ExitingBB = findBestLoopExit(F, L, LoopBlockSet);

BlockChain &LoopChain = *BlockToChain[LoopTop];		BlockChain &LoopChain = *BlockToChain[LoopTop];

// FIXME: This is a really lame way of walking the chains in the loop: we		// FIXME: This is a really lame way of walking the chains in the loop: we
// walk the blocks, and use a set to prevent visiting a particular chain		// walk the blocks, and use a set to prevent visiting a particular chain
// twice.		// twice.
SmallPtrSet<BlockChain *, 4> UpdatedPreds;		SmallPtrSet<BlockChain *, 4> UpdatedPreds;
Show All 14 Lines	for (MachineBasicBlock *ChainBB : Chain) {
}		}
}		}

if (Chain.LoopPredecessors == 0)		if (Chain.LoopPredecessors == 0)
BlockWorkList.push_back(*Chain.begin());		BlockWorkList.push_back(*Chain.begin());
}		}

buildChain(LoopTop, LoopChain, BlockWorkList, &LoopBlockSet);		buildChain(LoopTop, LoopChain, BlockWorkList, &LoopBlockSet);

		if (RotateLoopWithProfile)
		rotateLoopWithProfile(LoopChain, L, LoopBlockSet);
		else
rotateLoop(LoopChain, ExitingBB, LoopBlockSet);		rotateLoop(LoopChain, ExitingBB, LoopBlockSet);

DEBUG({		DEBUG({
// Crash at the end so we get all of the debugging output first.		// Crash at the end so we get all of the debugging output first.
bool BadLoop = false;		bool BadLoop = false;
if (LoopChain.LoopPredecessors) {		if (LoopChain.LoopPredecessors) {
BadLoop = true;		BadLoop = true;
dbgs() << "Loop chain contains a block without its preds placed!\n"		dbgs() << "Loop chain contains a block without its preds placed!\n"
<< " Loop header: " << getBlockName(*L.block_begin()) << "\n"		<< " Loop header: " << getBlockName(*L.block_begin()) << "\n"
▲ Show 20 Lines • Show All 368 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/code_placement_loop_rotation.ll

				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux < %s \| FileCheck %s -check-prefix=CHECK
				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -precise-rotation-cost < %s \| FileCheck %s -check-prefix=CHECK-PROFILE

				define void @foo() {
				; Test that not all edges in the loop chain are fall through without profile
				; data.
				;
				; CHECK-LABEL: foo:
				; CHECK: callq e
				; CHECK: callq f
				; CHECK: callq g
				; CHECK: callq h

				entry:
				br label %header

				header:
				call void @e()
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !2

				if.then:
				call void @f()
				br label %if.end

				if.else:
				call void @g()
				br label %if.end

				if.end:
				call void @h()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %header, label %end

				end:
				ret void
				}

				define void @bar() !prof !1 {
				; Test that all edges in the loop chain are fall through with profile data.
				;
				; CHECK-PROFILE-LABEL: bar:
				; CHECK-PROFILE: callq g
				; CHECK-PROFILE: callq h
				; CHECK-PROFILE: callq e
				; CHECK-PROFILE: callq f

				entry:
				br label %header

				header:
				call void @e()
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !2

				if.then:
				call void @f()
				br label %if.end

				if.else:
				call void @g()
				br label %if.end

				if.end:
				call void @h()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %header, label %end

				end:
				ret void
				}

				declare zeroext i1 @a()
				declare void @e()
				declare void @f()
				declare void @g()
				declare void @h()

				!1 = !{!"function_entry_count", i64 1}
				!2 = !{!"branch_weights", i32 16, i32 16}

llvm/trunk/test/CodeGen/X86/code_placement_loop_rotation2.ll

				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux < %s \| FileCheck %s -check-prefix=CHECK
				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -precise-rotation-cost < %s \| FileCheck %s -check-prefix=CHECK-PROFILE

				define void @foo() {
				; Test a nested loop case when profile data is not available.
				;
				; CHECK-LABEL: foo:
				; CHECK: callq b
				; CHECK: callq c
				; CHECK: callq d
				; CHECK: callq e
				; CHECK: callq f
				; CHECK: callq g
				; CHECK: callq h

				entry:
				br label %header

				header:
				call void @b()
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !2

				if.then:
				br label %header2

				header2:
				call void @c()
				%call1 = call zeroext i1 @a()
				br i1 %call1, label %if.then2, label %if.else2, !prof !2

				if.then2:
				call void @d()
				br label %if.end2

				if.else2:
				call void @e()
				br label %if.end2

				if.end2:
				call void @f()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %header2, label %if.end

				if.else:
				call void @g()
				br label %if.end

				if.end:
				call void @h()
				%call3 = call zeroext i1 @a()
				br i1 %call3, label %header, label %end

				end:
				ret void
				}

				define void @bar() !prof !1 {
				; Test a nested loop case when profile data is available.
				;
				; CHECK-PROFILE-LABEL: bar:
				; CHECK-PROFILE: callq e
				; CHECK-PROFILE: callq f
				; CHECK-PROFILE: callq c
				; CHECK-PROFILE: callq d
				; CHECK-PROFILE: callq h
				; CHECK-PROFILE: callq b
				; CHECK-PROFILE: callq g

				entry:
				br label %header

				header:
				call void @b()
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !2

				if.then:
				br label %header2

				header2:
				call void @c()
				%call1 = call zeroext i1 @a()
				br i1 %call1, label %if.then2, label %if.else2, !prof !2

				if.then2:
				call void @d()
				br label %if.end2

				if.else2:
				call void @e()
				br label %if.end2

				if.end2:
				call void @f()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %header2, label %if.end

				if.else:
				call void @g()
				br label %if.end

				if.end:
				call void @h()
				%call3 = call zeroext i1 @a()
				br i1 %call3, label %header, label %end

				end:
				ret void
				}

				declare zeroext i1 @a()
				declare void @b()
				declare void @c()
				declare void @d()
				declare void @e()
				declare void @f()
				declare void @g()
				declare void @h()

				!1 = !{!"function_entry_count", i64 1}
				!2 = !{!"branch_weights", i32 16, i32 16}