This is an archive of the discontinued LLVM Phabricator instance.

[MBP] Add whole chain to BlockFilterSet instead of individual BB
ClosedPublic

Authored by Carrot on Oct 8 2020, 6:14 PM.

Details

Summary

Currently we add individual BB to BlockFilterSet if its frequency satisfies

LoopFreq / Freq <= LoopToColdBlockRatio

LoopFreq is edge frequency from outside to loop header. LoopToColdBlockRatio is a command line parameter.

It doesn't make sense since we always layout whole chain, not individual BBs.

It may also cause a tricky problem. Sometimes it is possible that the LoopFreq of an inner loop is smaller than LoopFreq of outer loop. So a BB can be in BlockFilterSet of inner loop, but not in BlockFilterSet of outer loop, like .cold in the test case. So it is added to the chain of inner loop. When work on the outer loop, .cold is not added to BlockFilterSet, so the edge to successor .problem is not counted in UnscheduledPredecessors of .problem chain. But other blocks in the inner loop are added BlockFilterSet, so the whole inner loop chain can be layout, and markChainSuccessors is called to decrease UnscheduledPredecessors of following chains. markChainSuccessors calls markBlockSuccessors for every BB, even it is not in BlockFilterSet, like .cold, so .problem chain's UnscheduledPredecessors is decreased, but this edge was not counted on in fillWorkLists, so .problem chain's UnscheduledPredecessors becomes 0 when it still has an unscheduled predecessor .pred! And it causes problems in following various successor BB selection algorithms.

Diff Detail

Event Timeline

Carrot created this revision.Oct 8 2020, 6:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 8 2020, 6:14 PM
Carrot requested review of this revision.Oct 8 2020, 6:14 PM
davidxl added inline comments.Oct 13 2020, 11:57 AM
llvm/test/CodeGen/X86/block_set.ll
4

Without this patch, the 'cold' block will be outlined outside the loop nests, right? It also makes the outer loop tighter which seems better. With this change, if the outer loop is hot, there will be a hole there.

Carrot added inline comments.Oct 13 2020, 2:25 PM
llvm/test/CodeGen/X86/block_set.ll
4

The cold block will not be put outside of the loop nests no matter with or without this patch.

The LoopFreq value of inner loop is small enough so cold block is included in the chain for inner loop.
The LoopFreq value of outer loop is large enough so cold block is excluded in the BlockFilterSet of outer loop, but other BBs of inner loop is included. When we layout header2, the whole inner loop chain will be merged into the outer loop chain, including cold block. And the edge from .cold to .problem is used to decrement UnscheduledPredecessors of .problem chain, this is unexpected.

ok, makes sense. Basically since the inner loop is laid out first, the outer loop can not individually split out the blocks already laid out. What is the manifest of the problem? wrong profile data or compiler crash or something else?

For the attached test case, it can cause bad layout, block .problem is placed under .cold. With this patch .problem is optimally placed after .pred.

When combined with my BB duplication work, the problem also caused an infinite loop in duplication.

davidxl accepted this revision.Oct 13 2020, 5:21 PM

lgtm

This revision is now accepted and ready to land.Oct 13 2020, 5:21 PM
dmajor added a subscriber: dmajor.Oct 15 2020, 8:28 PM

Hi, this commit caused an assertion in one of our builds:

/builds/worker/fetches/llvm-project/llvm/lib/CodeGen/LiveDebugValues/VarLocBasedImpl.cpp:1669: bool (anonymous namespace)::VarLocBasedLDV::join(llvm::MachineBasicBlock &, (anonymous namespace)::VarLocBasedLDV::VarLocInMBB &, (anonymous namespace)::VarLocBasedLDV::VarLocInMBB &, const (anonymous namespace)::VarLocBasedLDV::VarLocMap &, SmallPtrSet<const llvm::MachineBasicBlock *, 16> &, SmallPtrSetImpl<const llvm::MachineBasicBlock *> &): Assertion `(NumVisited || MBB.pred_empty()) && "Should have processed at least one predecessor"' failed.

It's an LTO+PGO build of Firefox so reproducing it is very time-consuming. I will try to get more info but it may take time and I wanted to give you an advance notice. In the meantime is there anything you can guess by inspection about what went wrong?

@dmajor, without a test case it is difficult to point out what's wrong since there is no obvious relation between this patch and VarLocBasedImpl.cpp. I will try my best to do some simple analysis/guess:

VarLocBasedImpl.cpp runs a dataflow algorithm on a function to collect debug information.
This patch only impacts the field UnscheduledPredecessors of a BlockChain, it is mostly used as a heuristic in block layout algorithms. This patch doesn't directly modify control flow. So it triggered some unknown bug in either MachineBlockPlacement or VarLocBasedImpl.

Let's look at the assertion at VarLocBasedImpl.cpp:1668, the comment says the MBB is processed in reverse post-order, it expects either one of its predecessor is processed or MBB is an entry block. Assume the algorithm is correctly implemented, then the only situation which can trigger this assert is MBB doesn't have a predecessor and is not entry block. But in function VarLocBasedLDV::ExtendRanges, the MBB passed to join() is extracted from Worklist, which is the result of ReversePostOrderTraversal, so MBB must have a predecessor, otherwise it can't be put into Worklist. So the only explanation is there is something wrong in the algorithm implementation.
Another possibility is either MBB.preds or MBB.succs are not correctly maintained although it is not very likely.

dmajor added a subscriber: jmorse.Oct 20 2020, 2:26 PM

Adding @jmorse in case D83046 is related.

I managed to retrieve a linkrepro tarball from our CI builder, but it's 4GB and pretty cumbersome.

The dump of the affected machine function is: https://paste.mozilla.org/g44ZAPGr. We hit the assertion when processing bb.2. Its only predecessor bb.1 was skipped with ignoring unvisited pred MBB: 1, so in the end NumVisited is still zero.

(Aside: is it normal for bb.2 to come before bb.0 in the dump() output? I'm not familiar with the conventions in this area.)

Adding @jmorse in case D83046 is related.

Technically D83046 is NFC, it just juggles the files where the pass implementation lives, however this is definitely my area of focus,

The dump of the affected machine function is: https://paste.mozilla.org/g44ZAPGr.

Thanks for the dump; if I'm interpreting it correctly, the entry block itself has a predecessor:

bb.2 (%ir-block.16):
; predecessors: %bb.1

(Working on the convention that the first block in the function is the entry block). The dataflow implementation doesn't believe this can ever happen, and the assertion hit assumes that the entry block can be identified by its lack of predecessors. At a shallow level, that could be tightened up to identify the entry block by whether it's the first block in the function. IMO it'd be an acceptable immediate fix.

If an entry block with a predecessor is legal behaviour, we'd need to change other parts of the algorithm too as the formal dataflow definitions require a block with no predecessors to begin exploration at. There's a risk that more variable locations get dropped than necessary.

If an entry block with a predecessor is legal behaviour [...]

NB: the LLVM-IR langref section for functions states:

The first basic block in a function is special in two ways: it is immediately executed on entrance to the function, and it is not allowed to have predecessor basic blocks (i.e. there can not be any branches to the entry block of a function). Because the block can have no predecessors, it also cannot have any PHI nodes.

I'd lean towards the input Machine-IR to LiveDebugValues being incorrect here.

As jmorse has pointed out, the function layout is incorrect at the assertion time.

I double checked the code in MachineBlockPlacement::buildCFGChains(), the following two lines should never change the first block (assume it is entry) in a function.

BlockChain &FunctionChain = *BlockToChain[&F->front()];
buildChain(&F->front(), FunctionChain);

So there is no obvious problem in MBP, without live debugging a test case, it's difficult to guess the root problem.

Could you help to check when block bb.0 is moved below bb.2?

Could you help to check when block bb.0 is moved below bb.2?

It looks like it happened after MBP: https://paste.mozilla.org/oAoRCCOV

Chrome is also seeing the same issue: https://crbug.com/1140168

Then it should be an unknown bug in MBP.

You can debug the function MachineBlockPlacement::buildCFGChains() from following 2 lines, and try to find out when the first BB in FunctionChain is changed.

BlockChain &FunctionChain = *BlockToChain[&F->front()];
buildChain(&F->front(), FunctionChain);

Sorry I can't help you to debug it.

@aeubanks, can you extract a test case from your failure?
It will be very helpful for me to debug the problem.

rnk added a subscriber: rnk.Oct 22 2020, 11:04 AM

Has this been reverted yet? If we are confident that this change is causing assertion failures and it looks like a fix forward will be difficult, it's preferable to revert rather than fix forward.

It has been reverted.

There is a (very large) repro in https://crbug.com/1140168, if more is necessary let's follow up on that bug.

In case it helps, I turned on debug logging for this pass on our repro: https://paste.mozilla.org/HBDkeGXO (starts at line 60)

In case it helps, I turned on debug logging for this pass on our repro: https://paste.mozilla.org/HBDkeGXO (starts at line 60)

The debug logging is actually very helpful! From the log we can see that bb.0 is moved after bb.2 when rotating the loop chain. So the problem is in function rotateLoop, we should check if the chain header is entry block before rotating. We should also do similar check in function rotateLoopWithProfile.

@dmajor and @aeubanks, could you help to see if the following patch can fix the crash when combined with patch adfb5415010f?

Thanks!

diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index 8a8669638031..69f5b9d29c06 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -2312,6 +2312,10 @@ void MachineBlockPlacement::rotateLoop(BlockChain &LoopChain,
   if (Bottom == ExitingBB)
     return;
 
+  // The entry block should always be the first BB in a function.
+  if (Top == &F->front())
+    return;
+
   bool ViableTopFallthrough = hasViableTopFallthrough(Top, LoopBlockSet);
 
   // If the header has viable fallthrough, check whether the current loop
@@ -2386,6 +2390,11 @@ void MachineBlockPlacement::rotateLoopWithProfile(
     BlockChain &LoopChain, const MachineLoop &L,
     const BlockFilterSet &LoopBlockSet) {
   auto RotationPos = LoopChain.end();
+  MachineBasicBlock *ChainHeaderBB = *LoopChain.begin();
+
+  // The entry block should always be the first BB in a function.
+  if (ChainHeaderBB == &F->front())
+    return;
 
   BlockFrequency SmallestRotationCost = BlockFrequency::getMaxFrequency();
 
@@ -2404,7 +2413,6 @@ void MachineBlockPlacement::rotateLoopWithProfile(
   // chain head is not the loop header. As we only consider natural loops with
   // single header, this computation can be done only once.
   BlockFrequency HeaderFallThroughCost(0);
-  MachineBasicBlock *ChainHeaderBB = *LoopChain.begin();
   for (auto *Pred : ChainHeaderBB->predecessors()) {
     BlockChain *PredChain = BlockToChain[Pred];
     if (!LoopBlockSet.count(Pred) &&

@dmajor and @aeubanks, could you help to see if the following patch can fix the crash when combined with patch adfb5415010f?

From my side the build was successful!

The real root cause of crash in building chrome and firefox has been committed https://reviews.llvm.org/D92882.

I will reapply this patch tomorrow if there is no objection.

Confirmed our bots are still OK after the re-landing. Thanks!