This is an archive of the discontinued LLVM Phabricator instance.

[mlir][cf] Add ControlFlow to SCF lifting pass
ClosedPublic

Authored by zero9178 on Aug 2 2023, 6:28 AM.

Details

Summary

Structured control flow ops have proven very useful for many transformations doing analysis on conditional flow and loops. Doing these transformations on CFGs requires repeated analysis of the IR possibly leading to more complicated or less capable implementations. With structured control flow, a lot of the information is already present in the structure.

This patch therefore adds a transformation making it possible to lift arbitrary control flow graphs to structured control flow operations. The algorithm used is outlined in https://dl.acm.org/doi/10.1145/2693261. The complexity in implementing the algorithm was mostly spent correctly handling block arguments in MLIR (the paper only addresses the control flow graph part of it).

Note that the transformation has been implemented fully generically and does not depend on any dialect. An interface implemented by the caller is used to construct any operation necessary for the transformation, making it possible to create an interface implementation purpose fit for ones IR.

For the purpose of testing and due to likely being a very common scenario, this patch adds an interface implementation lifting the control flow dialect to the SCF dialect.
Note the use of the word "lifting". Unlike other conversion passes, this pass is not 100% guaranteed to convert all ControlFlow ops.
Only if the input region being transformed contains a single kind of return-like operations is it guaranteed to replace all control flow ops. If that is not the case, exactly one control flow op will remain branching to regions terminating with a given return-like operation (e.g. one region terminates with llvm.return the other with llvm.unreachable).


Additional notes:
Special thanks to the numba-mlir developers for the inspiration and some valuable testcases.
Their implementation of the same algorithm can be found here https://github.com/numba/numba-mlir/blob/9a3356a7d0ab99c69591fd5d26a98a9c5b93133a/mlir/lib/Conversion/CfgToScf.cpp
No code has been taken from their implementation, however.

I am also unsure who else is interested in such a pass, feel free to add more reviewers if anyone knows.

Diff Detail

Event Timeline

zero9178 created this revision.Aug 2 2023, 6:28 AM
Herald added a project: Restricted Project. · View Herald Transcript
zero9178 requested review of this revision.Aug 2 2023, 6:28 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 2 2023, 6:28 AM

Tour de force! This is super cool. A welcome addition to the upstream library of passes.

mlir/lib/Conversion/ControlFlowToSCF/ControlFlowToSCF.cpp
75

It seems like a better API would be to make all these functions FailureOr<Operation *> and then propagate the failure up the CFGToSCF pass itself. Not all structured CF dialects will be able to support everything the pass requires.

Either having the pass fail entirely or leave unraised CFGs in the IR seems better to me (I don't know how hard the latter option would be).

mlir/lib/Transforms/Utils/CFGToSCF.cpp
143

I'm personally not a fan of the addArguments API. It's cheaper to just loop over the arguments of other and call addArgument one-at-a-time because it saves the constructor of the vector

186

This should be a static method at the top-level

213

It would be great to get some documentation on the fields

246

function_ref is mapped into the MLIR namespace

464

Would it be cheaper to just keep a flag/counter around?

475
479

I'm not sure how big the SCCs get in practice, but I think DFS over the SCC would have better complexity (O(E+V) I think) than looping over every in/out edge of every block

532
650

The lambda here seems unnecessary given that it only has one return

717
1076

A minor optimization here would be to only check the terminator operations of each block in the region.

mlir/test/Conversion/ControlFlowToSCF/test.mlir
531

Yeah I was afraid of something like this when I skimmed through the algorithm. "Perfect reconstructibility" comes at a cost, but I still think this pass will be very useful when the input is "mostly nice".

Mogball added a comment.EditedAug 2 2023, 11:58 AM

Have you tried importing random C programs through LLVMIR and see how this pass performs? It would be very interesting to see how it scales for large programs (I struggle to guess the complexity just by reading the code).

Thanks for doing this! After this is merged I can try to run our e2e Python tests using pass.

We also have some patterns for post cfg-to-scf loop cleanup, but they can be upstreamed separately.

After this is merged I can try to run our e2e Python tests using pass.

That would be awesome. Please share your results! I haven't seen the numba-mlir repo until today and I must say I'm a fan

gysit added a comment.Aug 3 2023, 12:53 AM

Amazing progress!

I did a first pass and the logic makes sense to me. I left a couple of nit and naming comments. The latter can be seen as suggestions, take what you think makes sense.

mlir/include/mlir/Transforms/CFGToSCF.h
38

Would it make sense to prefix all methods that create SCF operation with SCF? E.g., for this one it may make sense to rename to createSCFBranchOp and below createSCFLoopOp. That way it may be a bit easier to follow in the code when new SCF stuff is created.

mlir/lib/Conversion/ControlFlowToSCF/ControlFlowToSCF.cpp
137–139
172–175

ultra nit:

mlir/lib/Transforms/Utils/CFGToSCF.cpp
92

or some other form of emphasizing the first would be nice since it is pretty important.

130–131
191

I would probably enumerate: Structure that contains the entry, exit, and back edges of a cycle.

194
205
211–213

nit: I would probably put the private member variables at the end of the class definition? Also maybe put an explicit private: at the beginning of the class definition?

235–238

nit: extraArgs contains the types of possible extra block arguments passed to the multiplexer block that are added to the successor operands of every outgoing edge.

258

nit: redirectEdges == entries? Would it make sense to rename to edgeToArgRangeMap, or similar?

305

Could we fill the array using an llvm::enumerate and if blocks that check based on the index what we need todo?

Something similar to this:

for (auto &it : llvm::enumerate(multiplexerBlock->getArguments()) {
  if (block.offset <= it.index() < block.offset + block.size) {
    // copy args of this block
   continue; 
  }
  if (it.index() == discriminatorIdx) {
    // copy discriminator 
    continue;
  }
  if (it.index() > discriminatorIdx) {
    // copy extra argument
    continue;
  }
  // create an undef value
}
316

nit: discriminatorAndExtraArgsSize? The term remaining seems a bit confusing here since I would interpret this as the arguments of the remaining blocks.

349

ultra nit: drop newline?

363

isn't caseArgument.size() > 1 iff discriminator != 0? if yes I would probably simplify to:

if (caseArguments.size() == 1)
  discriminator = getSwitchValue(0);
383

ultra nit: ReturnLikeOpEquivalence or ReturnLikeEquivalence?

407

nit: maybe RegionExitCombiner or similar? Transformer sounds a bit generic?

408–412
408–414

Let's move the private members to the end of the class definition? The comment could be transformed to a doc string?

424

nit: maybe rename to combineWithPrevExits or similar?

Also, would it make sense to shorten operation to op for some of the variables, e.g. returnLikeOperation -> returnLikeOp or originalOperation -> originalOp?

462

nit: hasChanged

471

nit: -> the entry, exit, and back edges of the given cycle.

497–503
550

nit: should we replace arc by edge here and below?

573–574
752
873

nit: I would favor if blocks over switch case in this case:

if (regionEntry->getNumSuccessors() == 0) {
  //
}
if (regionEntry->getNumSuccessors() == 1) {
  //
}
900
959

?

Especially the case three below is quite hard to follow and I am not sure the first sentence matches may understanding of the code. I wonder if examples would be more helpful here than text only?

961

Reading the code it seems the condition is somewhat more complex:

  • Either only a subset has a continuation edge
  • Or not all continuation edges point to the same continuation block.

is that correct?

I guess an example may be more intuitive to understand in this case.

1172

A comment that explains what the transform functions add to the work lists may make sense here albeit it being a bit repetitive given the function doc strings.

Mogball added inline comments.Aug 3 2023, 9:56 AM
mlir/include/mlir/Transforms/CFGToSCF.h
38

This is a generic interface and isn't specific to the SCF dialect (although the name of the interface is confusing SCF as in structured control flow with SCF the dialect...)

gysit added inline comments.Aug 3 2023, 10:16 AM
mlir/include/mlir/Transforms/CFGToSCF.h
38

I agree this is a bit confusing - in most places in this revision SCF actually refers to structured control flow an not to the dialect. Any suggestions for an alternative naming that does not collide with the dialect name is very welcome. Maybe StructuredCF?

zero9178 marked an inline comment as done.Aug 3 2023, 12:01 PM
zero9178 added inline comments.
mlir/lib/Transforms/Utils/CFGToSCF.cpp
479

Could you elaborate on your idea? I am not sure I follow as the edges we are interested in are both incoming edges from outside blocks and outgoing edges to outside blocks therefore requiring that we iterate over all of them. I am not sure how you could reduce the complexity here. A DFS over the SCC would simply yield me the SCC + all its transitive successors which would still be missing the entry edges into the SCC. Complexity wise, both the DFS and the successor check here also done one hash set lookup per outgoing edge.

mlir/test/Conversion/ControlFlowToSCF/test.mlir
531

There are even papers saying that irreducible control flow is incredibly rare :) I felt it was somewhat trivial to make irreducible loops work as shown in the paper however.
There is also room for improvement in the future with smarter edge multiplexer as we currently are essentially adding all block arguments of entry blocks up despite all but the block args of the block being branched to being undef. I.e. all the block arguments stemming from entry blocks in the multiplexer block are used mutually exclusively.
By creating a mapping of what block arguments of what type already exist on the multiplexer block you could then reduce the amount of block arguments. That combined with canonicalizations on the SCF ops and arith.select would make this a bit nicer, although the general structure with the switch/if inside the loop is likely to remain.

This is really exciting! Here is a grammar pass (use more punctuation please!).

mlir/include/mlir/Transforms/CFGToSCF.h
92
mlir/lib/Conversion/ControlFlowToSCF/ControlFlowToSCF.cpp
50–51
mlir/lib/Transforms/Utils/CFGToSCF.cpp
29
32–34
49

As much as I like ASCII art, calling it the "best" form of visualization might be an overstatement 😉

62

You're not continuing any sentence here 🙂

78–79
86–90
205–209

One sentence about what it does, one sentence about why (I would potentially even reverse the order)

421–423
508–509
530
636–639

Alternatively "similar to LCSSA form", but definitely not equal

822–827
zero9178 updated this revision to Diff 547730.Aug 7 2023, 5:13 AM
zero9178 marked 59 inline comments as done.
  • Address review comments
  • Fix some dominance related bugs and add regression tests
  • Fix some issues related to return-like ops and add regression tests

Have you tried importing random C programs through LLVMIR and see how this pass performs? It would be very interesting to see how it scales for large programs (I struggle to guess the complexity just by reading the code).

Purely implementation-wise, the pass is quadratic over the region nesting. There are two reasons:

  • The algorithm is top-down, starting with outer-most loops and conditional regions, moving inwards. It therefore needs to visit and move n - k blocks for each subregion created (where n is the block count of the parent region).
  • For each new region, the dominance tree is currently recalculated. This is a matter of implementation quality and can be improved by manually updating the dominance tree with the transformations performed in the future if necessary.

I have also run a quick, not very scientific, test of two LLVM IR input files, imported them into MLIR, lifted them to CF and run the pass over them. One I had lying around, the other is actually the CFGToSCF.cpp file compiled with clang (very meta :) ).
The smaller file had 1544 basic blocks, not counting entry blocks, measured with grep -P "^\s*\^\w+" | wc -l
The larger file had 3613 basic blocks, not counting entry blocks.
Operation count is useless metric as the algorithm is only sensitive to basic block count.

The smaller file took on average 10.71ms measured with 41 runs of ./mlir-opt --lift-cf-to-scf ~/CFGToSCF.mlir --mlir-timing > /dev/null 2> >(grep Lift >&2)
The larger file took on average 14.47ms measured the same way.
So not a very large increase but at the same time, just these two runs is not really suffiecent to talk about the complexity of the algorithm in practice.
I expect the region nesting to be comparatively very small in most programs.

mlir/include/mlir/Transforms/CFGToSCF.h
38

I've gone ahead and at least put Structured in the name of these to make them distinct from the CFG ones

mlir/lib/Conversion/ControlFlowToSCF/ControlFlowToSCF.cpp
75

Leaving unraised IR is somewhat problematic unless in the function op. I implemented it as signaling pass failure now. I see it as essentially a precondition violation of the interface.

mlir/lib/Transforms/Utils/CFGToSCF.cpp
258

I actually meant to refer to the redirectEdge method here.
entries I have renamed to blockArgMapping since that is essentially what it is: A mapping from the block arguments of an entry block to the block arguments in the multiplexer block.

363

This is not guaranteed due to the excluded set. It can still cause the discriminator to be redundant if it leads to only having one successor. I nevertheless switched to an easier to read if.

961

I have added some ASCII art. Enjoy :-)

gysit added a comment.Aug 8 2023, 2:34 AM

Very nice improvement, we are getting there!

I have some additional nit comments and there is one suspicious test that is probably worth having a second look at.

mlir/include/mlir/Transforms/CFGToSCF.h
84

ultra nit: createCFGSwitchOp for consistency?

mlir/lib/Transforms/Utils/CFGToSCF.cpp
198
594
726

nit: drop braces.

828

ultra nit: Would it make sense to name it newLoopParentBlock?

846

ultra nit: I would rename to loopOp or even structuredLoopOp to avoid the scf.

966

ultra nit.

1005

ultra nit: Any chance there is a way to illustrate a bit better that the return is part of region1? Otherwise the graphics are super nice!

1116

nit: subRegionEntryBlock or subRegionEntry to not confuse it with the regionEntry?

mlir/test/Conversion/ControlFlowToSCF/test.mlir
292

I would have expected that we return %[[ARG1]] instead of %[[POISON]] here since this should be the result of the control flow loop if I understand correctly?

zero9178 updated this revision to Diff 548151.Aug 8 2023, 4:40 AM
zero9178 marked 11 inline comments as done.

Address review comments:

  • Style nits
  • Fix bug with loop header block argument being incorrectly replaced
zero9178 updated this revision to Diff 548153.Aug 8 2023, 4:41 AM

don't forget about clang-format

mlir/lib/Transforms/Utils/CFGToSCF.cpp
1005

I tried to move the ret block further up and moving the Region 1 label to have about equal distance to both in the hopes of it looking clearer

mlir/test/Conversion/ControlFlowToSCF/test.mlir
292

Good catch! This was actually a bug :) The transformToReduceLoop method incorrectly handled loop header block arguments, accidently causing uses of the loop header block arguments outside the loop to be replaced by its values at the end of the loop body. This should be fixed now.

gysit accepted this revision.Aug 8 2023, 11:12 PM

LGTM

Looking forward to use this!

mlir/lib/Transforms/Utils/CFGToSCF.cpp
1005

nice!

This revision is now accepted and ready to land.Aug 8 2023, 11:12 PM
This revision was automatically updated to reflect the committed changes.

I run this pass on our python testsuite and results looking good

46 failed, 10950 passed, 3466 skipped, 445 xfailed, 46 warnings in 252.77s (0:04:12)

Look like most of the failures come from scf.execute_region, but I didn't looked in detail yet.
Example of failing IR https://gist.github.com/Hardcode84/8e55de4009e190af59cd2e7c25ebdbfa

I run this pass on our python testsuite and results looking good

46 failed, 10950 passed, 3466 skipped, 445 xfailed, 46 warnings in 252.77s (0:04:12)

Look like most of the failures come from scf.execute_region, but I didn't looked in detail yet.
Example of failing IR https://gist.github.com/Hardcode84/8e55de4009e190af59cd2e7c25ebdbfa

Thank you so much for testing! Did you use the transformCFGToSCF method or the lift-cfg-to-scf pass? Neither of the two currently traverse into subregions of ops. Unless you run the former on the region of the scf.execute it likely didn't do any transformations.
I think transformCFGToSCF should only operate on one region at a time, but we could think about making the lift-cfg-to-scf pass traverse into subregions in post-order.

I used the pass https://github.com/numba/numba-mlir/pull/189
I'll try to hack recursive version on top of transformCFGToSCF and see what happens

6 failed, 10990 passed, 3466 skipped, 445 xfailed, 46 warnings in 260.37s (0:04:20)

Much better, the rest if the failures seems to come from the fact we are not handling switch op correctly and not related to the pass directly.

Also, ControlFlowToSCFTransformation from pass impl probably should also be exposed, copypasted the entire thing for now.

Hardcode84 added a comment.EditedAug 13 2023, 5:46 AM

Ok, after lowering IndexSwitch to sequence of scf.if's all tests have passed. I guess, we are dropping our custom CFGToSCF and switching to upstream :)

Thanks for this work.

Ok, after lowering IndexSwitch to sequence of scf.if's all tests have passed. I guess, we are dropping our custom CFGToSCF and switching to upstream :)

Thanks for this work.

Super happy to hear that, and thank you for the original implementation!

@zero9178 I can make patches for subregion walk and exposing ControlFlowToSCFTransformation later this week.