Page MenuHomePhabricator

[mlir][Transforms] CSE of ops with a single block.
ClosedPublic

Authored by mravishankar on Sep 20 2022, 1:22 PM.

Details

Summary

Currently CSE does not support CSE of ops with regions. This patch
extends the CSE support to ops with a single region.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Rebase and address some comments.

mravishankar added inline comments.Sep 29 2022, 10:16 AM
mlir/lib/Transforms/CSE.cpp
52–53

It was broke indeed. Fixing this fixed all the tests that were failing. THanks for catching it! Now all tests pass.

119

So I dropped any hashing that accounted for regions. Makes the code simpler. It might cause more collisions, but IIUC the approach here is
(a) If hash is the same use isEqual to actually disambiguate.
(b) if hashes are different ops are different.

What's going on with the diff?

mlir/test/lib/Dialect/Test/TestOps.td
2992

super nit: can you capitalize CSE here?

What's going on with the diff?

What specifically. The diff itself is fine. I made some changes, and phabricator doesnt know how to realign

I don't see any changes to CSE.cpp anymore

I don't see any changes to CSE.cpp anymore

Thats surprising. I see changes to CSE.cpp here.

Handle commutative ops correctly.

@Mogball do you still not see the diff for CSE.cpp?

I can see them now. I'm not sure what happened

Rebase and address nit.

mravishankar marked an inline comment as done.Oct 4 2022, 3:00 PM

@rriddle and @Mogball wondering if you guys have any more comments on this (or whether this is landable or not). The functionality is a blocker for other work, so I'd like to get an idea of how to proceed here (if this approach has issues). Happy to address/modify the approach based on your recommendations.

Apologies for the delay, we have a company event this week.

mlir/lib/IR/OperationSupport.cpp
724–746

Why do you need two SmallVector? Should be easy enough to order OpResults before block arguments inside of the sort function.

Either way, what's the benefit of ordering block arguments differently from results?

mlir/lib/Transforms/CSE.cpp
17

I would've expected that this would already be included.

51–52

The comment of if they have a single region with a single block seems like something that should be pushed down to where the context is relevant. Just saying "No regions take the easy path"(or something similar) seems fine here.

62
72
78
96–99

Why check getParent here instead of just checking areEquivalentValues?

105

When does this happen (aside from results of the parent op)?

mlir/test/Transforms/cse.mlir
328–334

Can you group the CHECK lines in these tests? It's a little hard for me to read given the size of the regions.

mravishankar marked 3 inline comments as done.

Rebase and address comments.

mravishankar marked 7 inline comments as done.Oct 5 2022, 11:28 AM
mravishankar added inline comments.
mlir/lib/IR/OperationSupport.cpp
724–746

I combined the two to have a single sort function. I kept it separate for clarity, but there is really no need to do that .

As to why block arguments ordering matters, that helps in the equivalence comparison in the call back. For example, if you have the following two regions

^bb0(%a0 : f32, %a1: f32) :
  %0 = arith.addf %a0, %a1 : f32
  linalg.yield %0 : f32

and

^bb0(%b0 : f32, %b1: f32) :
  %1 = arith.addf %b0, %b1 : f32
  linalg.yield %1 : f32

These two regions are equivalent. But if you use the pointer values to sort it, when comparing arith.addf operands you can end up with the operands of the arith.addf in the first region being [%a0, %a1] and for the arith.addf in the second region being [%b1, %b0] . Then the equivalence check fails. If you account for the argument position as well.
Writing this though it still doesnt cover all the cases. You could have

^bb0(%a0 : f32, %a1: f32) :

%0 = arith.addf %a0, %a1 : f32
%1 = arith.mulf %a0, %a1 : f32
%2 = arith.addf %0, %1 : f32
linalg.yield %2 : f32
and

^bb0(%b0 : f32, %b1: f32) :

%3 = arith.addf %b0, %b1 : f32
%4 = arith.mulf %b0, %b1 : f32
%5 = arith.addf %3, %4 : f32
linalg.yield %5 : f32
which would still fail cause `%3` and `%4` would be ordered using the `Value`s pointer value. Thats an issue (and really hard to write a test cases for it). Is there a way to completely avoid using pointer value of `Value`?
mlir/lib/Transforms/CSE.cpp
96–99

I think you are right. Dropped the getParent check...

105

You need to treat the values created within a region by equivalent operations as equivalent. For example,

^bb0(%arg0 : f32, %arg1: f32, %arg2 : f32) :
  %0 = arith.addf %arg0, %arg1 : f32
  %1 = arith.mulf %0, %arg2 : f32
  linalg.yield %1 : f32

and

^bb0(%arg0 : f32, %arg1: f32, %arg2 : f32) :
  %2 = arith.addf %arg0, %arg1 : f32
  %3 = arith.mulf %2, %arg2 : f32
  linalg.yield %3 : f32

%0 and %2 are equivalent and based on that %3 and %1 are equivalent. Then the regions are the same and the ops can be CSEed (if the other conditions like operands are the same attributes are the same, etc. are met).

So getParent here checks that this is an op within the region of the two operations being compared, and they are then added to the equivalence set.

mlir/test/Transforms/cse.mlir
328–334

I was just being consistent with the file here. I prefer having the CHECK*s together too. Changed.

mravishankar marked 3 inline comments as done.

Drop unnecessary header.

mravishankar marked an inline comment as done.Oct 5 2022, 11:31 AM
mravishankar added inline comments.
mlir/lib/Transforms/CSE.cpp
17

Dont know right now why I needed it (I think it was to get the mlir::hash_value(Value v) definition. Dont need it now. Dropped

Mogball added inline comments.Oct 10 2022, 9:10 AM
mlir/lib/Transforms/CSE.cpp
79

auto [lhsArg, rhsArg] = ?

264

This comment needs an update

Mogball accepted this revision.Oct 10 2022, 9:11 AM

LG overall

mlir/lib/Transforms/CSE.cpp
85

This should really be added to Value. I've written this same code many times. Thoughts @rriddle ?

This revision is now accepted and ready to land.Oct 10 2022, 9:11 AM
mravishankar marked an inline comment as done.

Add an option to enable CSE of single block ops. This seems to have
pretty significant downstream effects. Making this enhancement opt-in
as a ramp to make it default available.

Fix comment.

I would prefer the option be true by default. If the pass is creating erroneous code, it's either a bug in the pass (which I assume isn't the case here), or something wrong with downstream users. It should always be legal to CSE an side-effect-free op.

mravishankar marked 3 inline comments as done.Oct 19 2022, 4:51 PM

@Mogball I made the CSE for single block ops guarded by a flag. I used it downstream, and it has some non-trivial effects. It is needed for some things, and this is probably an indication that we need a mechanism that allows conditional CSE (on a set of ops, or class of ops, etc.). For now the flag keeps the "new feature" guarded to use as needed. PTAL since this is a change to the patch from when you approved it.

rriddle requested changes to this revision.Oct 19 2022, 4:52 PM

@Mogball I made the CSE for single block ops guarded by a flag. I used it downstream, and it has some non-trivial effects. It is needed for some things, and this is probably an indication that we need a mechanism that allows conditional CSE (on a set of ops, or class of ops, etc.). For now the flag keeps the "new feature" guarded to use as needed. PTAL since this is a change to the patch from when you approved it.

What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.

This revision now requires changes to proceed.Oct 19 2022, 4:52 PM
What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.

Basically CSE increases lifetimes of operations. So actually in the long term we probably need CSE method to be a more controllable pass than it is today. The fact is that this is an issue today already. You could have an mhlo.dot which is a GEMM operation that will get CSE-ed, but when lowered to linalg.matmul it doesnt get CSEed, while the impact of CSE-ing these operations are the same....
It also has impact on computation that is in destination passing style. This is the issue I hit.

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%3 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%4 = linalg.fill ins(%cst) outs(%3) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%4) : tensor<?x?xf32>

Earlier the linalg.fill would not CSE (because actually its an op with a single block). Now it will

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>

While the semantics is correct, this is a more opinionated change than "just CSE". This is effectively adding false sharing of the output when one didnt exist earlier. So bufferization (or some pre-processing before bufferization) will have to break this false dependency. All these could have happened earlier as well with CSE. It is just exposed now because the linalg operations have a single block, and those just happened to not be covered by CSE. So what this change did is unearth an issue that already existed, which needs to be fixed by allowing CSE to be more controllable....
If having a flag is an issue, then please suggest a way forward. This is blocking quite a lot of things for me.....

FYI @ftynse @nicolasvasilache

What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.

Basically CSE increases lifetimes of operations. So actually in the long term we probably need CSE method to be a more controllable pass than it is today. The fact is that this is an issue today already. You could have an mhlo.dot which is a GEMM operation that will get CSE-ed, but when lowered to linalg.matmul it doesnt get CSEed, while the impact of CSE-ing these operations are the same....
It also has impact on computation that is in destination passing style. This is the issue I hit.

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%3 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%4 = linalg.fill ins(%cst) outs(%3) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%4) : tensor<?x?xf32>

Earlier the linalg.fill would not CSE (because actually its an op with a single block). Now it will

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>

While the semantics is correct, this is a more opinionated change than "just CSE". This is effectively adding false sharing of the output when one didnt exist earlier. So bufferization (or some pre-processing before bufferization) will have to break this false dependency. All these could have happened earlier as well with CSE. It is just exposed now because the linalg operations have a single block, and those just happened to not be covered by CSE. So what this change did is unearth an issue that already existed, which needs to be fixed by allowing CSE to be more controllable....

I really don't see the problem here, this is precisely what I would expect.
The fact that downstream assumptions are not valid anymore should not be a blocker for this to progress.

If having a flag is an issue, then please suggest a way forward. This is blocking quite a lot of things for me.....

FYI @ftynse @nicolasvasilache

It seems like you want downstream-specific behavior to be preserved, this suggests downstream-specific changes (either a custom CSE but ideally don't rely on such assumptions).

I would drop the flag from this PR.

Regardless of the discussion on Linalg vs. MHLO CSE behavior, which is arguably up to the dialects to decide, I think we should just (a) just enable this change without a flag as it looks reasonable to have and (b) consider finer-grain control over general simplification (canonicalization, CSE, constant folding and propagation) in the longer term.

Drop the optionality flag...

@rriddle dropped the flag. PTAL.

nicolasvasilache accepted this revision.Oct 28 2022, 2:57 AM

Thanks Mahesh!

I am fine with this as it stands now.

rriddle accepted this revision.Nov 1 2022, 10:44 PM
rriddle added inline comments.
mlir/lib/Transforms/CSE.cpp
264

Not done yet?

This revision is now accepted and ready to land.Nov 1 2022, 10:44 PM

Sorry for the delay, swamped lately. The flang build breakages look real though.

Rebase and update.

Rebase and add dependency on D137857

Fix failing lit test.

This revision was landed with ongoing or failed builds.Tue, Nov 15, 6:56 PM
This revision was automatically updated to reflect the committed changes.
aartbik added inline comments.Tue, Nov 15, 9:06 PM
mlir/test/Dialect/SparseTensor/codegen_buffer_initialization.mlir
20

LGTM, but we need a FIXME for the windows determinism