This is an archive of the discontinued LLVM Phabricator instance.

[mlir][SCF] CSE and hoist operations in scf.if branches
Needs ReviewPublic

Authored by springerm on Jan 24 2023, 8:29 AM.

Details

Summary

-cse does not eliminate common sub-expressions that appear on both branches of an scf.if op. Such IR is sometimes produced by the bufferization (e.g., duplicate memref.subview ops). CSE'ing can enable additional optimizations such as the removal of self-copies.

This revisions adds a new pass to CSE ops within scf.if branches.

Depends On: D143253

Diff Detail

Event Timeline

springerm created this revision.Jan 24 2023, 8:29 AM
springerm requested review of this revision.Jan 24 2023, 8:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2023, 8:29 AM

This is awesome! Out of curiosity, is the intent for this to be run as part of a fixed-point iteration in order to handle nested scf.ifs?

This is awesome! Out of curiosity, is the intent for this to be run as part of a fixed-point iteration in order to handle nested scf.ifs?

void SCFCSEBranches::runOnOperation() {
  getOperation()->walk([](IfOp ifOp) {

This walk is post-order, so a single run of the pass should be sufficient to handle nested scf.ifs. But it may be beneficial to run this pass multiple times interleaved with the normal -cse pass because this pass only hoists+CSE's direct children ops of scf.if.

mravishankar requested changes to this revision.Jan 24 2023, 9:25 AM

Could we just use the hoisting part here. You can just hoist operations outside of if and let standard CSE common out the users.... THis is duplicating a lot of functionality

This revision now requires changes to proceed.Jan 24 2023, 9:25 AM

Could we just use the hoisting part here. You can just hoist operations outside of if and let standard CSE common out the users.... THis is duplicating a lot of functionality

We could but this allows us to CSE irrespective of op spectuability and it has no effect on performance (because CSE’d ops are executed in either branch).

I’m going to address the TODO tmr and reuse isEqual() from the other CSE.cpp, then there’s no code duplication anymore.

mravishankar added a comment.EditedJan 24 2023, 10:03 AM

Could we just use the hoisting part here. You can just hoist operations outside of if and let standard CSE common out the users.... THis is duplicating a lot of functionality

We could but this allows us to CSE irrespective of op spectuability and it has no effect on performance (because CSE’d ops are executed in either branch).

You are CSE-ing and hoisting... thats already within the preview of treating it as a speculatable op. Doing this on non-speculatable ops would be illegal AFAIK.

I’m going to address the TODO tmr and reuse isEqual() from the other CSE.cpp, then there’s no code duplication anymore.

You are CSE-ing and hoisting... thats already within the preview of treating it as a speculatable op. Doing this on non-speculatable ops would be illegal AFAIK.

That’s why this pass is specific to scf.if. It does not hoist or CSE non-specutable ops from other ops. That should be legal.

You are CSE-ing and hoisting... thats already within the preview of treating it as a speculatable op. Doing this on non-speculatable ops would be illegal AFAIK.

That’s why this pass is specific to scf.if. It does not hoist or CSE non-specutable ops from other ops. That should be legal.

Then I am confused. If it is speculatable only, then you can just hoist.. Then CSE will handle the equivalence and replacing uses.... Not sure what I am missing....

springerm added a comment.EditedJan 25 2023, 12:37 AM

Then I am confused. If it is speculatable only, then you can just hoist.. Then CSE will handle the equivalence and replacing uses.... Not sure what I am missing....

What I meant:

%r = scf.if ... {
  "non_specutable_op"()
  ...
} else {
  "non_specutable_op"()
  ...
}

==> /* This transformation does not affect performance. */

"non_specutable_op"()
%r = scf.if ... {
  ...
} else {
  ...
}

Hoisting and CSE'ing "non_spectuable_op" from both branches is OK. Any op for that matter. Just hoisting from the "then" branch is not OK; that's only allowed if the op is pure and specutable.

My main concerns about "just hoisting" are performance implications. Should we hoist a tensor.insert_slice from an scf.if? That's a buffer copy after bufferization, so it can lead to a performance regression. We could go even further: Hoist the entire "then" and "else" bodies (assuming they are all pure+specutable ops; usually the case in tensor IR), so that only the terminators are left over; scf.if will become an arith.select. That's probably not desirable.

So we would need to define a set of "hoistable" ops: ops that we can hoisted without causing UB (specutable) or performance regressions. The latter condition is currently not modeled in MLIR. Sure, I could write a pattern to just hoist memref.subview for now, but we will probably need to hoist additional ops soon (e.g., memref.cast) and I'm thinking of a more general solution here.

springerm updated this revision to Diff 492016.Jan 25 2023, 1:02 AM

Share operation equivalence check with CSE.cpp

rriddle requested changes to this revision.Jan 25 2023, 1:08 AM

This change to CSE doesn't seem that ideal to me in general, it's leaking the internal details of what CSE is looking for -- which may change in the future, in which case we'd have to propagate that to the hoisting pass being added here.

This revision now requires changes to proceed.Jan 25 2023, 1:08 AM

This change to CSE doesn't seem that ideal to me in general, it's leaking the internal details of what CSE is looking for -- which may change in the future, in which case we'd have to propagate that to the hoisting pass being added here.

How about about extending OperationEquivalence in OperationSupport.h so that it also takes into account the contents of nested regions? (E.g., adding a new mlir::isEquivalent(Operation *, Operation *) helper to that file. It would have to be a bit smarter than the check in this file; i.e., support >1 regions/blocks.) Then we don't need a custom isEqual check in either CSE pass.

springerm added a comment.EditedJan 25 2023, 3:24 AM

This change to CSE doesn't seem that ideal to me in general, it's leaking the internal details of what CSE is looking for -- which may change in the future, in which case we'd have to propagate that to the hoisting pass being added here.

How about about extending OperationEquivalence in OperationSupport.h so that it also takes into account the contents of nested regions? (E.g., adding a new mlir::isEquivalent(Operation *, Operation *) helper to that file. It would have to be a bit smarter than the check in this file; i.e., support >1 regions/blocks.) Then we don't need a custom isEqual check in either CSE pass.

OperationEquivalent::isEquivalentTo already checks regions, but corresponding bbArgs of two ops are not considered "equivalent" by default (unless they are added to mapOperands). E.g.:

%0 = tensor.generate %size {
  ^bb0(%arg0: index):
  ...
} : tensor<?xf32>

%1 = tensor.generate %size {
  ^bb0(%arg1: index):
  ...
} : tensor<?xf32>

%arg0 and %arg1 are not equivalent by default. They must be marked as such in mapOperands. That's why OperationEquivalent::isEquivalentTo cannot be directly used in CSE.

springerm updated this revision to Diff 492139.Jan 25 2023, 8:55 AM

address comments

mravishankar added a comment.EditedJan 25 2023, 10:07 PM

Then I am confused. If it is speculatable only, then you can just hoist.. Then CSE will handle the equivalence and replacing uses.... Not sure what I am missing....

What I meant:

%r = scf.if ... {
  "non_specutable_op"()
  ...
} else {
  "non_specutable_op"()
  ...
}

==> /* This transformation does not affect performance. */

"non_specutable_op"()
%r = scf.if ... {
  ...
} else {
  ...
}

Hoisting and CSE'ing "non_spectuable_op" from both branches is OK. Any op for that matter. Just hoisting from the "then" branch is not OK; that's only allowed if the op is pure and specutable.

My main concerns about "just hoisting" are performance implications. Should we hoist a tensor.insert_slice from an scf.if? That's a buffer copy after bufferization, so it can lead to a performance regression. We could go even further: Hoist the entire "then" and "else" bodies (assuming they are all pure+specutable ops; usually the case in tensor IR), so that only the terminators are left over; scf.if will become an arith.select. That's probably not desirable.

That seems to be mixing concerns here. If a tensor.insert_slice can be hoisted, bufferization should be immune to that (I'd think it would even be better, but dont know the example that you have in mind). Hoisting entire then and else branches and ending up with a select seems to be a good thing in general.

So we would need to define a set of "hoistable" ops: ops that we can hoisted without causing UB (specutable) or performance regressions. The latter condition is currently not modeled in MLIR. Sure, I could write a pattern to just hoist memref.subview for now, but we will probably need to hoist additional ops soon (e.g., memref.cast) and I'm thinking of a more general solution here.

I dont fully understand this. Hoisting should always be better since the dependence chain you track doesnt need to go through conditionals. So I am not able to see the regression aspect.... Hoisting can also reduce life times of values...

Overall, I'd like some more input on these. Maybe @rriddle since he is already looking into it... Overall, I dont have very strong concerns, but seems like a specific solution when it doesnt need to be.

If a tensor.insert_slice can be hoisted, bufferization should be immune to that (I'd think it would even be better, but dont know the example that you have in mind). Hoisting entire then and else branches and ending up with a select seems to be a good thing in general.

Yes it is actually better for the bufferization — a flat structure is easier to bufferize than one with nested blocks; fewer things can go wrong.

But hoisting things also means that ops that were previously inside a branch and executed 50% of the time are now executed 100% of the time. (After bufferization, pure tensor ops have a side effect and therefore a cost.)

If a tensor.insert_slice can be hoisted, bufferization should be immune to that (I'd think it would even be better, but dont know the example that you have in mind). Hoisting entire then and else branches and ending up with a select seems to be a good thing in general.

Yes it is actually better for the bufferization — a flat structure is easier to bufferize than one with nested blocks; fewer things can go wrong.

But hoisting things also means that ops that were previously inside a branch and executed 50% of the time are now executed 100% of the time. (After bufferization, pure tensor ops have a side effect and therefore a cost.)

IMO that is just how tensor-based codegeneration works.... You cannot rely on that not being hoisted, unless there is some other mechanism to stop it from hoisting (either move it into a region that is not isolated from above, or maybe we need a way to say "dont move any op outside of this region"). But like I said, I can live with what you are doing here... but would really hope someone else can weigh in if they have a more informed opinion.

@rriddle This is no longer leaking internal details CSE. Does this look good to you?

Hardcode84 added inline comments.
mlir/lib/Dialect/SCF/Transforms/CSE.cpp
35

I think first check ifOp.getThenRegion().hasOneBlock() is useless and will always succeed.

44

I think llvm::make_early_inc_range(*thenBlock) will work.

Hardcode84 added inline comments.Jan 31 2023, 8:36 AM
mlir/lib/Dialect/SCF/Transforms/CSE.cpp
59

This algorithm is quadratic as it compares each op from then block with each op from else block. Is this ok for MLIR? It probably can be optimized by putting ops from then block into hashmap first and using it while traversing else block.

springerm updated this revision to Diff 494563.Feb 3 2023, 2:24 AM
springerm edited the summary of this revision. (Show Details)

address comments

springerm edited the summary of this revision. (Show Details)Feb 3 2023, 2:24 AM
springerm marked 2 inline comments as done.
springerm added inline comments.
mlir/lib/Dialect/SCF/Transforms/CSE.cpp
35

I wasn't sure because the "else" block used to allow multiple blocks. I think this was an oversight, addressed in D143253.

Hardcode84 added inline comments.Feb 3 2023, 3:16 AM
mlir/lib/Dialect/SCF/Transforms/CSE.cpp
35

Check for else region is still needed, but it can be !elseRegion.empty()or IfOp.getNumResults() != 0, also pls add test for scf.if without else just to check we don't crash on it.

springerm updated this revision to Diff 494582.Feb 3 2023, 3:24 AM
springerm marked an inline comment as done.

address comments

springerm marked an inline comment as done.Feb 3 2023, 3:25 AM

Couldn't CSE be extended to support the BranchOpInferface and work on scf.if the same way it works on CFG?

mlir/include/mlir/Dialect/SCF/Transforms/Passes.h
26

Typo: eliminates

springerm added a comment.EditedFeb 4 2023, 6:38 AM

Couldn't CSE be extended to support the BranchOpInferface and work on scf.if the same way it works on CFG?

This CSE variant is a bit different from Transforms/CSE.cpp in a sense that even ops with side effects can be CSE'd if they appear on both branches. So far so good, it should be possible to do this as part of CSE::simplifyOperation.

However, I think the RegionBranchOpInterface is not powerful enough. From the documentation of getSuccessorRegions:

These are the regions that may be selected during the flow of control.

The problem here is "may be". By querying this interface, we cannot deduce that either the "then" branch or the "else" branch will be executed. getSuccessorRegions will return both regions, but it is possible that none of the regions is executed.