This is a reimplementation of the orderNodes function, as the old implementation didn't take into account all cases.
Fix PR41509
Differential D79037
[StructurizeCFG] Fix region nodes ordering ekatz on Apr 28 2020, 1:45 PM. Authored by
Details This is a reimplementation of the orderNodes function, as the old implementation didn't take into account all cases. Fix PR41509
Diff Detail
Event TimelineComment Actions This needs more tests. There are a bunch of decisions being taken during the reordering, but the test peresented simply exercises the case of one nested loop inside another. For example, what about two inner loops L1 and L2, where L1 has an exit that reaches the entry of L2? There might be other interesting cases too. We actually introduced a different pass recently, called "UnifyLoopExits" to circumvent the same problem in the structurizer. You can see it optionally invoked in the AMDGPU backend. The idea is to transform each loop to have a single exit block, which makes it look like a separate region; since the structurizer is a region pass, it no longer sees a loop nest and thus cannot get confused about the nesting. The current patch can provide a different, more precise way to solve the same problem. But please do take a look at the tests that use UnifyLoopExits. They are under test/Transforms/UnifyLoopExits and test/Transforms/StructurizeCFG/workarounds.
Comment Actions It seems the new commit is an incremental update on the previous commit. Can you please submit a single squashed commit, so that the change can be compared against the original state of the trunk? Also, the following tests were created specifically to demonstrate the wrong ordering:
If the proposed change is correct, then the example in these tests will be structurized correctly even after removing the "-unify-loop-exits" argument. These tests should also be checked and then updated accordingly. In fact, it would be nice to update the following tests. But they have a very confusing history and they are specific to AMDGPU. I can volunteer to do this later as a follow up.
Comment Actions I am not sure what do you mean. In this page in the Phabricator, under [Revision Contents]->[History] you can choose which of the patch revisions to diff. The default is "Base" against the latest "Diff no.".
Great! Didn't know about those. I'll update them, and if everything is OK, I'll move them out of the "workarounds" directory.
I'll try to take a look at those as well. These are "XFAIL" right now. If I'm understanding correctly, You assume they should pass with this patch? Comment Actions Sorry about that! I can see the diff now ...
Note that needs-fr-ule.ll is quite a painful CFG from a real world testcase. It needs FixIrreducible pass, which produces extra loops that have the same ordering problem being fixed here. I'll try to run your patch myself and point out which blocks show up in the wrong order.
That would be great! But like I said, the file history is confusing, and the last time that I looked, I was unable to decide which blocks are in the wrong order.
Comment Actions The implementation looks okay to me. The new comments help a lot. Waiting for tests mentioned in the comments so far ... one new test (sibling loops where the DFS jumps from one to the other before finishing the first loop) and updates to existing tests. Comment Actions I have bad news. This fix does not help with the test needs-fr-ule.ll, i.e., the resulting CFG is incorret if I remove the unify-loop-exits pass. The fix-irreducible pass produces an inner loop with the header "irr.guard" and a latch "while.cond47". This is inside an outer loop with header "while.cond" and many backedges. When I run the structurizer with the current fix on this graph, the backedge of the inner loop is lost. Instead we see a spurious loop between two new blocks Flow5 and Flow4, which is governed by incorrect boolean constants when in fact they should have been %Pred variables. That information is lost along the edge Flow8 to Flow5. Now I am kicking myself for not remembering this earlier. I had encountered the same problem, which is why we introduced the unify-loop-exits pass. Things are not so easy with nested loops. It's not enough to process all inner blocks before outer blocks. The ordering of the outer blocks relative to the inner loop also matters, and there does not seem to be a practical way to preserve it. Comment Actions The problem with those tests is something else, and not produced by the wrong ordering. You may refer to PR25378 for an example. Comment Actions I see. It sure does look like PR25378. Then the only thing left is one new test with two sibling loops. The intention is to cover the case where CurrentLoop is not the parent of L. Comment Actions I don't think it is possible, although I might be wrong. I hope I explained it clear enough. Otherwise, can you please give an example? Comment Actions I agree. That means that if L exists, then CurrentLoop must be its parent, right? So instead of checking for this condition, the code will have to assert it. Comment Actions Actually, I take that back. Consider the following CFG: A -> H1, H2
Comment Actions In this case, those loops are in different regions, and as this is a region pass, they won't appear in the same POT. Comment Actions You need to look at the larger pattern than just the specific example. It should be possible to add exits in both loops that force them into the same region. For example, H2->F and B->F. Comment Actions Added test for interleaved sibling loops, and change the test file name to "interleaved-loop-order.ll". Comment Actions LGTM. It would be nice if you can move the following tests to the new file and demonstrate that that they are fixed. But that need not hold up this review any longer.
[EDIT] Actually added the test names. Comment Actions The CFG produced by "backedge-id-bug.ll" seems to be OK, just a little different than expected due to a different ordering scheme used when it was produced. This needs to be double checked, but as I see it, the expected just needs to be updated. In any case, those files produce the same CFG with and without this patch. Comment Actions Hi, This commit makes some compilations go into an infinite loop. I will try to prepare a reproducer. Comment Actions TBH, I spent a small amount of time visualizing the new testcase, but couldn't figure out the problem, nor the solution. I see that the check for currentLoopSize is moved up the control flow a bit, but I didn't see why. A few LLVM_DEBUG lines would be very helpful to trace what the structurizer is doing. Also, can the new test case be reduced further? There are a couple of simple chains, and a few branches. Are all of them necessary? The main problem seems to be an inner loop with just two blocks where both blocks exit. I don't know whether both these properties matter. For example, will the problem happen with a single block that loops on itself? Comment Actions We need to move up the check for *CurrentLoopSize, as in the case it is zero, then we no longer need to skip nodes, and just continue to the next node/loop. Comment Actions Actually, I spent some more time looking at this today. In the simplified testcase, the region being processed is { D, E, C } listed in post-order. Each one of these is a child of some loop, although the loop for D is not in the current region. As a result, when the outermost loop starts, CurrentLoop is nullptr but CurrentLoopSize is zero because there is no node in the null loop. Checking for CurrentLoopSize is not a very satisfying explanation for the fix, since the loop for D actually has an entry in LoopSizes. In other words, the pre-condition is that the sum of LoopSizes is the number of nodes in the region. Then the post-condition should be that the same sum should be zero. An assertion for this would fail in this testcase. Comment Actions This is a second take of the solution, using SCCs instead of LoopInfo. Comment Actions Thanks! This looks really nice now! The sad part is that the structurizer itself is undefined in the presence of irreducible regions, so we will never actually see this in action around irreducible regions.
Comment Actions Update algorithm to use a custom GraphTraits and a NodeRef that holds the set of nodes for the subgraph. Comment Actions This looks really good. Thanks a lot for bringing it this far! Please do check the #include for LoopInfo.h before submitting.
|
I think this can be removed too. At least a quick grep did not show any use of Loop or LoopInfo other than the code that gets eleminated with this patch.