Currently CSE does not support CSE of ops with regions. This patch
extends the CSE support to ops with a single region.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
@nicolasvasilache this is the draft I have for CSE-ing ops with a single block. Its probably more complex than it needs to be since I think a lot of the things related to hash computation can either be dropped, or simplified a lot.
@rriddle and @nicolasvasilache this is first attempt at adding CSE support for blocks with a single region and single block. There are some failures with this, but first want to get an idea if this is along the right direction.
mlir/include/mlir/IR/OperationSupport.h | ||
---|---|---|
862 ↗ | (On Diff #462634) | I'm not sure operation equivalence needs to be modified to support this change, mainly because this doesn't support multiblock regions. You can hash the region separately from the operation and compare. |
mlir/lib/Transforms/CSE.cpp | ||
119 | I'm surprised this works. Isn't this hashing the block pointer? |
Thanks @Mogball . I did need this, so happy to get input and adapt accordingly.
mlir/include/mlir/IR/OperationSupport.h | ||
---|---|---|
862 ↗ | (On Diff #462634) | I think you are right.... The change to computeHash itself can be dropped. Needs the changes to OperationEquivalence though.... |
mlir/lib/Transforms/CSE.cpp | ||
119 | Yes. AFAIK, its just a hash. Worst case has some collisions. |
mlir/lib/Transforms/CSE.cpp | ||
---|---|---|
119 | What surprises me is that the test cases below work, which shouldn't happen if the hashes are different, but in this case they are because the block pointers are different for each region. |
mlir/lib/Transforms/CSE.cpp | ||
---|---|---|
119 | Thats was my starting assumption too. My starting understanding was
So in theory we can explicitly ignore the region hashes, and it should still work . |
mlir/include/mlir/IR/OperationSupport.h | ||
---|---|---|
863 ↗ | (On Diff #462634) | I would either uncomment, or just remove the variable name (it adds nothing here). |
mlir/lib/Transforms/CSE.cpp | ||
52–53 | This path looks broken. Why is this no longer calling into OperationEquivalence? Please separate out the region code into a different function so that this one is a bit easier to follow. | |
119 | The point of hashing is to simplify the number of collisions. The point @Mogball is making is that this hash will never match with any other really, given that no other region will have block inside of it. How are you matching region operations if the hash never matches? |
mlir/lib/Transforms/CSE.cpp | ||
---|---|---|
119 | I think I understand what you guys are saying. Would it work if I just ignore the region during the hash computation. |
Since we are looking at this and there is discussion of splitting, could we factor out some of the CSE logic into independent utility functions? I'd like to be able to call that for a single region without running a pass across everything, like we can do with various pattern rewriters.
mlir/lib/Transforms/CSE.cpp | ||
---|---|---|
52–53 | It was broke indeed. Fixing this fixed all the tests that were failing. THanks for catching it! Now all tests pass. | |
119 | So I dropped any hashing that accounted for regions. Makes the code simpler. It might cause more collisions, but IIUC the approach here is |
What's going on with the diff?
mlir/test/lib/Dialect/Test/TestOps.td | ||
---|---|---|
2992 | super nit: can you capitalize CSE here? |
What specifically. The diff itself is fine. I made some changes, and phabricator doesnt know how to realign
@rriddle and @Mogball wondering if you guys have any more comments on this (or whether this is landable or not). The functionality is a blocker for other work, so I'd like to get an idea of how to proceed here (if this approach has issues). Happy to address/modify the approach based on your recommendations.
Apologies for the delay, we have a company event this week.
mlir/lib/IR/OperationSupport.cpp | ||
---|---|---|
724–746 | Why do you need two SmallVector? Should be easy enough to order OpResults before block arguments inside of the sort function. Either way, what's the benefit of ordering block arguments differently from results? | |
mlir/lib/Transforms/CSE.cpp | ||
17 | I would've expected that this would already be included. | |
51–52 | The comment of if they have a single region with a single block seems like something that should be pushed down to where the context is relevant. Just saying "No regions take the easy path"(or something similar) seems fine here. | |
62 | ||
72 | ||
78 | ||
96–99 | Why check getParent here instead of just checking areEquivalentValues? | |
105 | When does this happen (aside from results of the parent op)? | |
mlir/test/Transforms/cse.mlir | ||
328–334 | Can you group the CHECK lines in these tests? It's a little hard for me to read given the size of the regions. |
mlir/lib/IR/OperationSupport.cpp | ||
---|---|---|
724–746 | I combined the two to have a single sort function. I kept it separate for clarity, but there is really no need to do that . As to why block arguments ordering matters, that helps in the equivalence comparison in the call back. For example, if you have the following two regions ^bb0(%a0 : f32, %a1: f32) : %0 = arith.addf %a0, %a1 : f32 linalg.yield %0 : f32 and ^bb0(%b0 : f32, %b1: f32) : %1 = arith.addf %b0, %b1 : f32 linalg.yield %1 : f32 These two regions are equivalent. But if you use the pointer values to sort it, when comparing arith.addf operands you can end up with the operands of the arith.addf in the first region being [%a0, %a1] and for the arith.addf in the second region being [%b1, %b0] . Then the equivalence check fails. If you account for the argument position as well. ^bb0(%a0 : f32, %a1: f32) : %0 = arith.addf %a0, %a1 : f32 %1 = arith.mulf %a0, %a1 : f32 %2 = arith.addf %0, %1 : f32 linalg.yield %2 : f32 and ^bb0(%b0 : f32, %b1: f32) : %3 = arith.addf %b0, %b1 : f32 %4 = arith.mulf %b0, %b1 : f32 %5 = arith.addf %3, %4 : f32 linalg.yield %5 : f32 which would still fail cause `%3` and `%4` would be ordered using the `Value`s pointer value. Thats an issue (and really hard to write a test cases for it). Is there a way to completely avoid using pointer value of `Value`? | |
mlir/lib/Transforms/CSE.cpp | ||
96–99 | I think you are right. Dropped the getParent check... | |
105 | You need to treat the values created within a region by equivalent operations as equivalent. For example, ^bb0(%arg0 : f32, %arg1: f32, %arg2 : f32) : %0 = arith.addf %arg0, %arg1 : f32 %1 = arith.mulf %0, %arg2 : f32 linalg.yield %1 : f32 and ^bb0(%arg0 : f32, %arg1: f32, %arg2 : f32) : %2 = arith.addf %arg0, %arg1 : f32 %3 = arith.mulf %2, %arg2 : f32 linalg.yield %3 : f32 %0 and %2 are equivalent and based on that %3 and %1 are equivalent. Then the regions are the same and the ops can be CSEed (if the other conditions like operands are the same attributes are the same, etc. are met). So getParent here checks that this is an op within the region of the two operations being compared, and they are then added to the equivalence set. | |
mlir/test/Transforms/cse.mlir | ||
328–334 | I was just being consistent with the file here. I prefer having the CHECK*s together too. Changed. |
mlir/lib/Transforms/CSE.cpp | ||
---|---|---|
17 | Dont know right now why I needed it (I think it was to get the mlir::hash_value(Value v) definition. Dont need it now. Dropped |
Add an option to enable CSE of single block ops. This seems to have
pretty significant downstream effects. Making this enhancement opt-in
as a ramp to make it default available.
I would prefer the option be true by default. If the pass is creating erroneous code, it's either a bug in the pass (which I assume isn't the case here), or something wrong with downstream users. It should always be legal to CSE an side-effect-free op.
@Mogball I made the CSE for single block ops guarded by a flag. I used it downstream, and it has some non-trivial effects. It is needed for some things, and this is probably an indication that we need a mechanism that allows conditional CSE (on a set of ops, or class of ops, etc.). For now the flag keeps the "new feature" guarded to use as needed. PTAL since this is a change to the patch from when you approved it.
What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.
What are the non-trivial effects? I'm quite concerned about guarding things behind a flag.
Basically CSE increases lifetimes of operations. So actually in the long term we probably need CSE method to be a more controllable pass than it is today. The fact is that this is an issue today already. You could have an mhlo.dot which is a GEMM operation that will get CSE-ed, but when lowered to linalg.matmul it doesnt get CSEed, while the impact of CSE-ing these operations are the same....
It also has impact on computation that is in destination passing style. This is the issue I hit.
%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32> %1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32> %2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32> %3 = tensor.empty [%d0, %d1] : tensor<?x?xf32> %4 = linalg.fill ins(%cst) outs(%3) : tensor<?x?xf32> %5 = linalg.matmul ins(...) outs(%4) : tensor<?x?xf32>
Earlier the linalg.fill would not CSE (because actually its an op with a single block). Now it will
%0 = tensor.empty [%d0, %d1] : tensor<?x?xf32> %1 = linalg.fill ins(%cst) outs(%0) : tensor<?x?xf32> %2 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32> %5 = linalg.matmul ins(...) outs(%1) : tensor<?x?xf32>
While the semantics is correct, this is a more opinionated change than "just CSE". This is effectively adding false sharing of the output when one didnt exist earlier. So bufferization (or some pre-processing before bufferization) will have to break this false dependency. All these could have happened earlier as well with CSE. It is just exposed now because the linalg operations have a single block, and those just happened to not be covered by CSE. So what this change did is unearth an issue that already existed, which needs to be fixed by allowing CSE to be more controllable....
If having a flag is an issue, then please suggest a way forward. This is blocking quite a lot of things for me.....
I really don't see the problem here, this is precisely what I would expect.
The fact that downstream assumptions are not valid anymore should not be a blocker for this to progress.
If having a flag is an issue, then please suggest a way forward. This is blocking quite a lot of things for me.....
It seems like you want downstream-specific behavior to be preserved, this suggests downstream-specific changes (either a custom CSE but ideally don't rely on such assumptions).
I would drop the flag from this PR.
Regardless of the discussion on Linalg vs. MHLO CSE behavior, which is arguably up to the dialects to decide, I think we should just (a) just enable this change without a flag as it looks reasonable to have and (b) consider finer-grain control over general simplification (canonicalization, CSE, constant folding and propagation) in the longer term.
mlir/lib/Transforms/CSE.cpp | ||
---|---|---|
264 | Not done yet? |
mlir/test/Dialect/SparseTensor/codegen_buffer_initialization.mlir | ||
---|---|---|
20 | LGTM, but we need a FIXME for the windows determinism |
Why do you need two SmallVector? Should be easy enough to order OpResults before block arguments inside of the sort function.
Either way, what's the benefit of ordering block arguments differently from results?