This is an archive of the discontinued LLVM Phabricator instance.

Improvement on computing edge probabilities when choosing the best successor in machine block placement.
ClosedPublic

Authored by congh on Jun 29 2015, 4:53 PM.

Details

Summary

When looking for the best successor from the outer loop for a block belonging to an inner loop, the edge probability computation can be improved so that edges in the inner loop are ignored. For example, suppose we are building chains for the non-loop part of the following code, and looking for B1's best successor. Assume the true body is very hot, then B3 should be the best candidate. However, because of the existence of the back edge from B1 to B0, the probability from B1 to B3 can be very small, preventing B3 to be its successor. In this patch, when computing the probability of the edge from B1 to B3, the weight on the back edge B1->B0 is ignored, so that B1->B3 will have 100% probability.

if (...)

do {
  B0;
  ... // some branches
  B1;
} while(...);

else

B2;

B3;

Diff Detail

Repository
rL LLVM

Event Timeline

congh updated this revision to Diff 28732.Jun 29 2015, 4:53 PM
congh retitled this revision from to Improvement on computing edge probabilities when choosing the best successor in machine block placement. .
congh updated this object.
congh edited the test plan for this revision. (Show Details)
congh added a reviewer: chandlerc.
congh added a subscriber: Unknown Object (MLST).
davidxl edited edge metadata.Jul 7 2015, 11:06 AM

Looks reasonable -- as the already laid-out inner loop should be treated as a 'collapsed' node.

lib/CodeGen/MachineBlockPlacement.cpp
392 ↗(On Diff #28732)

I think it is better to provide a wrapper method to compute getSumForBlock (BB). In the wrapper method, exclude edge whose destination is in the same loop chain as BB.

congh updated this revision to Diff 29494.Jul 10 2015, 2:09 PM
congh edited edge metadata.

Update the patch to add context lines.

congh updated this revision to Diff 30426.Jul 22 2015, 4:46 PM
congh added a reviewer: dexonsmith.

Fixed a bug in the patch.

davidxl added inline comments.Jul 23 2015, 12:30 AM
lib/CodeGen/MachineBlockPlacement.cpp
367 ↗(On Diff #30426)

Do SumWeight adjustment here -- outside the main loop:

uint32_t SumWeight = ...
SumWeight = AdjustWeightSumExcludingInternalBackEdge(SumWeight, BB);

386 ↗(On Diff #30426)

Do you have examples of the problem when chain formation is for a 'peer' loop?

393 ↗(On Diff #30426)

need to check SuccLoop !=0 ?

393 ↗(On Diff #30426)

It is more readable to move the Sumweight adjustment loop outside the main for loop and probably in a helper function.

397 ↗(On Diff #30426)

No need to change SuccProb to SuccProbInLoop.

402 ↗(On Diff #30426)

No need for var name change.

434 ↗(On Diff #30426)

Change SuccProb to RealSuccProb

congh marked 3 inline comments as done.Jul 23 2015, 11:03 AM
congh added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
367 ↗(On Diff #30426)

The adjustment on SumWeight is done only when it has a successor outside of its loop that is also in BlockFilter (if the successor outside of the loop is not in BlockFilter, we are building chains for the loop). It makes more sense to calculate this adjusted weight inside of this loop. Note that when we build chains for the loop, we don't need this adjusted weight at all.

386 ↗(On Diff #30426)

The following partial CFG shows such a case:

A
⇵
B
↓
C
⇵
D

Here, B->C is an edge between two peer loops.

393 ↗(On Diff #30426)

SuccLoop can be 0 when the successor is not in any loop.

davidxl added inline comments.Jul 23 2015, 12:03 PM
lib/CodeGen/MachineBlockPlacement.cpp
367 ↗(On Diff #30426)

I think the adjustment can be done without even needing to have a top level for loop like

for (auto Succ: BB->successors()) {

if (...) continue;
for (auto SuccBB : BB->successors())

​ if (MLI->getLoopFor(SuccBB) == L)

		​          SumWeight2 -= MBPI->getEdgeWeight(BB, SuccBB) / WeightScale;
    ..

}

but simply:

for (auto SuccBB : BB->successors())

        // We find out that SuccBB is already in 'Chain', but it is not the end of 'Chain' -- it must be a block in the inner 
        // loop including BB that has already be laid out. Skip it 
        if (BlockToChain(SuccBB) == &Chain)   
		​   SumWeight -= MBPI->getEdgeWeight(BB, SuccBB) / WeightScale;
386 ↗(On Diff #30426)

Ok. A test case would be good.

congh added inline comments.Jul 23 2015, 1:45 PM
lib/CodeGen/MachineBlockPlacement.cpp
367 ↗(On Diff #30426)

But if the first successor in BB->successors() is not in the same loop as BB, we won't get the correct SumWeight in this way.

As most CFG nodes have at most two outgoing edges, we won't get many repeated computations.

386 ↗(On Diff #30426)

I think my comment here is incorrect. In the case above, we are not building chains for the peer loop but the outer loop, while the successor belongs to a peer loop. I will correct the comment.

386 ↗(On Diff #30426)

OK. Will add such a case.

davidxl added inline comments.Jul 23 2015, 2:25 PM
lib/CodeGen/MachineBlockPlacement.cpp
367 ↗(On Diff #30426)

If the successor of BB is not in the same loop as BB but matches the condition I listed, there are two cases:

  1. the succBB is in an inner loop of BB's loop, in this case we want to skip this SuccBB
  2. the succBB is in the current loop being laid out. If succBB satisfies the condition, this block must be a hotter block that is laid out before 'BB' with top-order broken. In this case, edge weight (BB->SuccBB) should also be skipped -- this is not any different from the backedge case.
congh added inline comments.Jul 23 2015, 2:47 PM
lib/CodeGen/MachineBlockPlacement.cpp
367 ↗(On Diff #30426)

Now I understand your suggested simple loop: you are putting that simple loop before the outer for loop instead of inserting the if statement inside of it.

Considering the case 2 above, I think now it makes more sense to hoist this sum weight update outside of the loop.

I will update the patch accordingly. Thanks for the advice!

congh updated this revision to Diff 30533.Jul 23 2015, 3:52 PM

Update the patch.

When looking for the best successor of a block, now we only consider its valid successors that could become its successor in the chain, and calculate the sum of weights from those valid successors which is later used to calculate edge probabilities. This can handle more general cases than the original patch.

davidxl added inline comments.Jul 23 2015, 4:26 PM
lib/CodeGen/MachineBlockPlacement.cpp
373 ↗(On Diff #30533)

This is the case where Succ is in the outer loop of the current loop being laid out. In such a case, Succ can still be a viable layout successor of BB if BB ends up being the end of the current chain. This is tricky to handle as loop rotation may also change that -- so it seems to good to add this skipping condition, but I'd like to see test cases that can validate it.

381 ↗(On Diff #30533)

This is a good one to add.

congh added inline comments.Jul 23 2015, 5:08 PM
lib/CodeGen/MachineBlockPlacement.cpp
373 ↗(On Diff #30533)

Given the following example:

--->A
|  / \
| B   C
|  \ / \
----D   E

Assume the weights of A->B and A->C are 1 and 99, and the weights of C->D and C->E are 50 and 50. Without this patch, after A->C is chosen, C->D could not be chosen due to CFG constraint and the probability of it is only 50%. So the generated layout for the loop is A C B D. With this patch, the layout will be A C D B, which is better than before. (Note that with loop rotation the latter may suffer some issue but it isn't the problem of this patch; I will deal with it alter).

I will add such a test case.

congh updated this revision to Diff 30549.Jul 23 2015, 5:27 PM

Add one more test case showing that we ignore the edge outside of the loop during weights summation when we build the chain for the loop.

chandlerc added inline comments.Jul 24 2015, 7:59 PM
lib/CodeGen/MachineBlockPlacement.cpp
365–371 ↗(On Diff #30549)

This is made incredibly annoying because we don't rescale the branch probabilities in MBPI the way we do in BPI, and so we don't know for a fact that they fit in 32-bits.

I think we need to fix MBPI to automatically scale all the weights so that the sum of weights is known to fit in 32-bits. Once that's done, WeightScale and all the code in getSumForBlock to rescale things goes away.

Then, here, you should change this to just compute the sum directly rather than getting the sum from MBPI and then adjusting it after-the-fact.

Also, I would just embed the predicate into a lambda, and write two loops over BB->successors() rather than building up a vector of successors.

381–383 ↗(On Diff #30549)

This is a bigger change than the rest, and I think it should go in separately.

The rest of this change is a clear bug-fix for how this code works, and will make the "hot" detection actually reasonable below. This change, however, is shifting a core thing to prefer to outline all optional loop bodies which don't have fallthrough into them. That, to me, is a very significant change, and I'd rather look at it (and test cases and benchmarks showing it is clearly the right call) independently of fixing how we compute the probability of 1 of N successors.

davidxl added inline comments.Jul 24 2015, 9:19 PM
lib/CodeGen/MachineBlockPlacement.cpp
381–383 ↗(On Diff #30549)

The core idea of Cong's patch is to make trace formation more correctly. The basic principle of the original trace formation (probability based) is that when a BB has multiple viable successors, but none of them is more dominating, then do not use outgoing succ edge frequency to determine layout successor

The bug in the current implementation is that when checking dominating outgoing edges, it also look at non-viable successors that have already being laid out.

The fix here is simply to eliminate the non-candidate when selecting trace successor.

Cong, as I mentioned before, it is much clearer to extract the adjustment code into a helper routine with documentation to make the intention clear.

congh added inline comments.Jul 27 2015, 12:36 AM
lib/CodeGen/MachineBlockPlacement.cpp
365–371 ↗(On Diff #30549)

OK. Let me try to produce a patch doing edge weight normalization for MBPI.

381–383 ↗(On Diff #30549)

I have some cases that this change is beneficial. But I agree we need to reevaluate its impact on loops without fall-through to header (from the aspect of icache locality). I can disable this "optimization" attempt from this patch and do more research on it.

congh added inline comments.Jul 28 2015, 3:55 PM
lib/CodeGen/MachineBlockPlacement.cpp
365–371 ↗(On Diff #30549)

I found it not straightforward to normalize weights in MBPI. Note that MBPI reads weights from MBB, which provides the interface addSuccessor(MachineBasicBlock *succ, uint32_t weight) to let users indicate arbitrary weights for new added edges. Because we don't know how users use this interface, it can be difficult to adjust weights in MBB. I think there may be two solutions:

  1. We provide the interface normalizeWeights in MBB. Once we call this interface, the sum of weights is guaranteed to be less than UINT32_MAX. However, after the user add new successors with weights, we need to re-normalize it. This can still be useful when we want to calculate branch probabilities or do other adjustment after the weights are normalized.
  1. We maintained both normalized and original weights in MBB, so that once the original weights is modified, we also adjust the normalized weights. This solutions requires more space and more frequent calculations comparing to the first one, but the normalized weights are always guaranteed.

Do you have any other ideas? Or which one do you think is better?

Will get back to it .

Looks fine to me with some nits. It might be better to hold off this until the BP interface change is done?

lib/CodeGen/MachineBlockPlacement.cpp
373 ↗(On Diff #30549)

Can you add your example as a comment in the code?

congh updated this revision to Diff 40450.Nov 17 2015, 4:06 PM

Update the patch according to David's comment.

congh marked an inline comment as done.Nov 17 2015, 4:08 PM

Looks fine to me with some nits. It might be better to hold off this until the BP interface change is done?

That patch will update the code in this file and I think it is fine to check in it now.

This revision was automatically updated to reflect the committed changes.