This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Transforms] CSE of ops with a single block.
ClosedPublic

Authored by mravishankar on Sep 20 2022, 1:22 PM.

Details

Summary

Currently CSE does not support CSE of ops with regions. This patch
extends the CSE support to ops with a single region.

Diff Detail

Event Timeline

mravishankar created this revision.Sep 20 2022, 1:22 PM
Herald added a project: Restricted Project. · View Herald Transcript

@nicolasvasilache this is the draft I have for CSE-ing ops with a single block. Its probably more complex than it needs to be since I think a lot of the things related to hash computation can either be dropped, or simplified a lot.

mravishankar published this revision for review.Sep 23 2022, 5:51 PM
mravishankar retitled this revision from [WIP] CSE of ops with a single block. to [mlir][Transforms] CSE of ops with a single block..
mravishankar edited the summary of this revision. (Show Details)
Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2022, 5:51 PM

@rriddle and @nicolasvasilache this is first attempt at adding CSE support for blocks with a single region and single block. There are some failures with this, but first want to get an idea if this is along the right direction.

Mogball added inline comments.Sep 26 2022, 12:43 PM
mlir/include/mlir/IR/OperationSupport.h
862 ↗(On Diff #462634)

I'm not sure operation equivalence needs to be modified to support this change, mainly because this doesn't support multiblock regions. You can hash the region separately from the operation and compare.

mlir/lib/Transforms/CSE.cpp
120

I'm surprised this works. Isn't this hashing the block pointer?

I'll take a more in depth look later.

Thanks @Mogball . I did need this, so happy to get input and adapt accordingly.

mlir/include/mlir/IR/OperationSupport.h
862 ↗(On Diff #462634)

I think you are right.... The change to computeHash itself can be dropped. Needs the changes to OperationEquivalence though....

mlir/lib/Transforms/CSE.cpp
120

Yes. AFAIK, its just a hash. Worst case has some collisions.

Mogball added inline comments.Sep 26 2022, 2:25 PM
mlir/lib/Transforms/CSE.cpp
120

What surprises me is that the test cases below work, which shouldn't happen if the hashes are different, but in this case they are because the block pointers are different for each region.

mravishankar added inline comments.Sep 26 2022, 6:10 PM
mlir/lib/Transforms/CSE.cpp
120

Thats was my starting assumption too. My starting understanding was

  1. It is the isEqual method that really comparese two operations. If that returns true for two operations that can be CSEed then thats enough.
  2. The hash is just a way to do fast compares. If two ops have the same hash, they "might" be same. So first you compare hashes, and if hash is the same you compare the op explicitly.

So in theory we can explicitly ignore the region hashes, and it should still work .

rriddle added inline comments.Sep 27 2022, 9:34 AM
mlir/include/mlir/IR/OperationSupport.h
863 ↗(On Diff #462634)

I would either uncomment, or just remove the variable name (it adds nothing here).

mlir/lib/Transforms/CSE.cpp
53–54

This path looks broken. Why is this no longer calling into OperationEquivalence? Please separate out the region code into a different function so that this one is a bit easier to follow.

120

The point of hashing is to simplify the number of collisions. The point @Mogball is making is that this hash will never match with any other really, given that no other region will have block inside of it. How are you matching region operations if the hash never matches?

mravishankar added inline comments.Sep 28 2022, 10:42 AM
mlir/lib/Transforms/CSE.cpp
120

I think I understand what you guys are saying. Would it work if I just ignore the region during the hash computation.

ftynse added a subscriber: ftynse.Sep 29 2022, 9:07 AM

Since we are looking at this and there is discussion of splitting, could we factor out some of the CSE logic into independent utility functions? I'd like to be able to call that for a single region without running a pass across everything, like we can do with various pattern rewriters.

Rebase and address some comments.

mravishankar added inline comments.Sep 29 2022, 10:16 AM
mlir/lib/Transforms/CSE.cpp
53–54

It was broke indeed. Fixing this fixed all the tests that were failing. THanks for catching it! Now all tests pass.

120

So I dropped any hashing that accounted for regions. Makes the code simpler. It might cause more collisions, but IIUC the approach here is
(a) If hash is the same use isEqual to actually disambiguate.
(b) if hashes are different ops are different.

What's going on with the diff?

mlir/test/lib/Dialect/Test/TestOps.td
3031

super nit: can you capitalize CSE here?

What's going on with the diff?

What specifically. The diff itself is fine. I made some changes, and phabricator doesnt know how to realign

I don't see any changes to CSE.cpp anymore

I don't see any changes to CSE.cpp anymore

Thats surprising. I see changes to CSE.cpp here.

Handle commutative ops correctly.

@Mogball do you still not see the diff for CSE.cpp?

I can see them now. I'm not sure what happened

Rebase and address nit.

mravishankar marked an inline comment as done.Oct 4 2022, 3:00 PM

@rriddle and @Mogball wondering if you guys have any more comments on this (or whether this is landable or not). The functionality is a blocker for other work, so I'd like to get an idea of how to proceed here (if this approach has issues). Happy to address/modify the approach based on your recommendations.

Apologies for the delay, we have a company event this week.

mlir/lib/IR/OperationSupport.cpp
723–745

Why do you need two SmallVector? Should be easy enough to order OpResults before block arguments inside of the sort function.

Either way, what's the benefit of ordering block arguments differently from results?

mlir/lib/Transforms/CSE.cpp
17

I would've expected that this would already be included.

52–53

The comment of if they have a single region with a single block seems like something that should be pushed down to where the context is relevant. Just saying "No regions take the easy path"(or something similar) seems fine here.

63
73
79
97–100

Why check getParent here instead of just checking areEquivalentValues?

106

When does this happen (aside from results of the parent op)?

mlir/test/Transforms/cse.mlir
328–334

Can you group the CHECK lines in these tests? It's a little hard for me to read given the size of the regions.

mravishankar marked 3 inline comments as done.

Rebase and address comments.

mravishankar marked 7 inline comments as done.Oct 5 2022, 11:28 AM
mravishankar added inline comments.
mlir/lib/IR/OperationSupport.cpp
723–745

I combined the two to have a single sort function. I kept it separate for clarity, but there is really no need to do that .

As to why block arguments ordering matters, that helps in the equivalence comparison in the call back. For example, if you have the following two regions

^bb0(%a0 : f32, %a1: f32) :
  %0 = arith.addf %a0, %a1 : f32
  linalg.yield %0 : f32

and

^bb0(%b0 : f32, %b1: f32) :
  %1 = arith.addf %b0, %b1 : f32
  linalg.yield %1 : f32

These two regions are equivalent. But if you use the pointer values to sort it, when comparing arith.addf operands you can end up with the operands of the arith.addf in the first region being [%a0, %a1] and for the arith.addf in the second region being [%b1, %b0] . Then the equivalence check fails. If you account for the argument position as well.
Writing this though it still doesnt cover all the cases. You could have

^bb0(%a0 : f32, %a1: f32) :

%0 = arith.addf %a0, %a1 : f32
%1 = arith.mulf %a0, %a1 : f32
%2 = arith.addf %0, %1 : f32
linalg.yield %2 : f32
and

^bb0(%b0 : f32, %b1: f32) :

%3 = arith.addf %b0, %b1 : f32
%4 = arith.mulf %b0, %b1 : f32
%5 = arith.addf %3, %4 : f32
linalg.yield %5 : f32
which would still fail cause `%3` and `%4` would be ordered using the `Value`s pointer value. Thats an issue (and really hard to write a test cases for it). Is there a way to completely avoid using pointer value of `Value`?
mlir/lib/Transforms/CSE.cpp
97–100

I think you are right. Dropped the getParent check...

106

You need to treat the values created within a region by equivalent operations as equivalent. For example,

^bb0(%arg0 : f32, %arg1: f32, %arg2 : f32) :
  %0 = arith.addf %arg0, %arg1 : f32
  %1 = arith.mulf %0, %arg2 : f32
  linalg.yield %1 : f32

and

^bb0(%arg0 : f32, %arg1: f32, %arg2 : f32) :
  %2 = arith.addf %arg0, %arg1 : f32
  %3 = arith.mulf %2, %arg2 : f32
  linalg.yield %3 : f32

%0 and %2 are equivalent and based on that %3 and %1 are equivalent. Then the regions are the same and the ops can be CSEed (if the other conditions like operands are the same attributes are the same, etc. are met).

So getParent here checks that this is an op within the region of the two operations being compared, and they are then added to the equivalence set.

mlir/test/Transforms/cse.mlir
328–334

I was just being consistent with the file here. I prefer having the CHECK*s together too. Changed.

mravishankar marked 3 inline comments as done.

Drop unnecessary header.

mravishankar marked an inline comment as done.Oct 5 2022, 11:31 AM
mravishankar added inline comments.
mlir/lib/Transforms/CSE.cpp
17

Dont know right now why I needed it (I think it was to get the mlir::hash_value(Value v) definition. Dont need it now. Dropped

Mogball added inline comments.Oct 10 2022, 9:10 AM
mlir/lib/Transforms/CSE.cpp
80

auto [lhsArg, rhsArg] = ?

265

This comment needs an update

Mogball accepted this revision.Oct 10 2022, 9:11 AM

LG overall

mlir/lib/Transforms/CSE.cpp
86

This should really be added to Value. I've written this same code many times. Thoughts @rriddle ?

This revision is now accepted and ready to land.Oct 10 2022, 9:11 AM
mravishankar marked an inline comment as done.

Add an option to enable CSE of single block ops. This seems to have
pretty significant downstream effects. Making this enhancement opt-in
as a ramp to make it default available.

Fix comment.

I would prefer the option be true by default. If the pass is creating erroneous code, it's either a bug in the pass (which I assume isn't the case here), or something wrong with downstream users. It should always be legal to CSE an side-effect-free op.

mravishankar marked 3 inline comments as done.Oct 19 2022, 4:51 PM

@Mogball I made the CSE for single block ops guarded by a flag. I used it downstream, and it has some non-trivial effects. It is needed for some things, and this is probably an indication that we need a mechanism that allows conditional CSE (on a set of ops, or class of ops, etc.). For now the flag keeps the "new feature" guarded to use as needed. PTAL since this is a change to the patch from when you approved it.

rriddle requested changes to this revision.Oct 19 2022, 4:52 PM

@Mogball I made the CSE for single block ops guarded by a flag. I used it downstream, and it has some non-trivial effects. It is needed for some things, and this is probably an indication that we need a mechanism that allows conditional CSE (on a set of ops, or class of ops, etc.). For now the flag keeps the "new feature" guarded to use as needed. PTAL since this is a change to the patch from when you approved it.

What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.

This revision now requires changes to proceed.Oct 19 2022, 4:52 PM
What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.

Basically CSE increases lifetimes of operations. So actually in the long term we probably need CSE method to be a more controllable pass than it is today. The fact is that this is an issue today already. You could have an mhlo.dot which is a GEMM operation that will get CSE-ed, but when lowered to linalg.matmul it doesnt get CSEed, while the impact of CSE-ing these operations are the same....
It also has impact on computation that is in destination passing style. This is the issue I hit.

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%3 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%4 = linalg.fill ins(%cst) outs(%3) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%4) : tensor<?x?xf32>

Earlier the linalg.fill would not CSE (because actually its an op with a single block). Now it will

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>

While the semantics is correct, this is a more opinionated change than "just CSE". This is effectively adding false sharing of the output when one didnt exist earlier. So bufferization (or some pre-processing before bufferization) will have to break this false dependency. All these could have happened earlier as well with CSE. It is just exposed now because the linalg operations have a single block, and those just happened to not be covered by CSE. So what this change did is unearth an issue that already existed, which needs to be fixed by allowing CSE to be more controllable....
If having a flag is an issue, then please suggest a way forward. This is blocking quite a lot of things for me.....

FYI @ftynse @nicolasvasilache

What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.

Basically CSE increases lifetimes of operations. So actually in the long term we probably need CSE method to be a more controllable pass than it is today. The fact is that this is an issue today already. You could have an mhlo.dot which is a GEMM operation that will get CSE-ed, but when lowered to linalg.matmul it doesnt get CSEed, while the impact of CSE-ing these operations are the same....
It also has impact on computation that is in destination passing style. This is the issue I hit.

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%3 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%4 = linalg.fill ins(%cst) outs(%3) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%4) : tensor<?x?xf32>

Earlier the linalg.fill would not CSE (because actually its an op with a single block). Now it will

%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32>
%1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32>
%2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
%5 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>

While the semantics is correct, this is a more opinionated change than "just CSE". This is effectively adding false sharing of the output when one didnt exist earlier. So bufferization (or some pre-processing before bufferization) will have to break this false dependency. All these could have happened earlier as well with CSE. It is just exposed now because the linalg operations have a single block, and those just happened to not be covered by CSE. So what this change did is unearth an issue that already existed, which needs to be fixed by allowing CSE to be more controllable....

I really don't see the problem here, this is precisely what I would expect.
The fact that downstream assumptions are not valid anymore should not be a blocker for this to progress.

If having a flag is an issue, then please suggest a way forward. This is blocking quite a lot of things for me.....

FYI @ftynse @nicolasvasilache

It seems like you want downstream-specific behavior to be preserved, this suggests downstream-specific changes (either a custom CSE but ideally don't rely on such assumptions).

I would drop the flag from this PR.

Regardless of the discussion on Linalg vs. MHLO CSE behavior, which is arguably up to the dialects to decide, I think we should just (a) just enable this change without a flag as it looks reasonable to have and (b) consider finer-grain control over general simplification (canonicalization, CSE, constant folding and propagation) in the longer term.

Drop the optionality flag...

@rriddle dropped the flag. PTAL.

nicolasvasilache accepted this revision.Oct 28 2022, 2:57 AM

Thanks Mahesh!

I am fine with this as it stands now.

rriddle accepted this revision.Nov 1 2022, 10:44 PM
rriddle added inline comments.
mlir/lib/Transforms/CSE.cpp
265

Not done yet?

This revision is now accepted and ready to land.Nov 1 2022, 10:44 PM

Sorry for the delay, swamped lately. The flang build breakages look real though.

Rebase and update.

Rebase and add dependency on D137857

Fix failing lit test.

This revision was landed with ongoing or failed builds.Nov 15 2022, 6:56 PM
This revision was automatically updated to reflect the committed changes.
aartbik added inline comments.Nov 15 2022, 9:06 PM
mlir/test/Dialect/SparseTensor/codegen_buffer_initialization.mlir
20 ↗(On Diff #475659)

LGTM, but we need a FIXME for the windows determinism