This is an archive of the discontinued LLVM Phabricator instance.

[MLIR] Don't sort operand of commutative ops when comparing two ops as there is a correctness issue
ClosedPublic

Authored by tomnatan on Jul 7 2023, 3:00 AM.

Details

Summary

This feature was introduced in D123492.

Doing equivalence on pointers to sort operands of commutative operations is incorrect when checking equivalence of ops in separate regions (where the lhs and rhs operands are marked as equivalent but are not the same value).

It was also discussed in D123492 and D129480 that the correct solution would be to stable sort the operands in canonicalization (based on some numbering in the region maybe), but until that lands, reverting this change will unblock us and other users.

An example of a pass that might not work properly because of this is DuplicateFunctionEliminationPass.

Diff Detail

Event Timeline

tomnatan created this revision.Jul 7 2023, 3:00 AM
Herald added a project: Restricted Project. · View Herald Transcript
tomnatan requested review of this revision.Jul 7 2023, 3:00 AM
tomnatan edited the summary of this revision. (Show Details)Jul 7 2023, 3:03 AM
tomnatan retitled this revision from Revert `D123492` that added the ability to remove commutative operations to [MLIR] Revert `D123492` that added the ability to remove commutative operations.
tomnatan edited the summary of this revision. (Show Details)
tomnatan updated this revision to Diff 538057.Jul 7 2023, 3:08 AM

Fixed comment

tomnatan updated this revision to Diff 538074.Jul 7 2023, 4:15 AM

Fix/remove failing tests

tomnatan updated this revision to Diff 538098.Jul 7 2023, 5:22 AM

fix flang test

Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2023, 5:22 AM

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

flang/test/Fir/commute.fir
11

I would prefer to keep the test as is and mark it as fail until we have another solution in place,

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

Only that test or all tests I've removed? how do you mark a test as fails? Just in the name and remove the checks?

clementval added inline comments.Jul 7 2023, 10:13 AM
flang/test/Fir/commute.fir
11

At least this one.

There is a way to tell lit we expect this to fail. You should add // XFAIL: * after the run line.

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

The example is an internal Google mlir dump, I'm not sure if I can post it publicly as not all dialects are open sourced. But it's essentially two very big but identical regions for which IsRegionEquivalentTo returns false, but returns true with this change.

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

The example is an internal Google mlir dump, I'm not sure if I can post it publicly as not all dialects are open sourced. But it's essentially two very big but identical regions for which IsRegionEquivalentTo returns false, but returns true with this change.

If you can share smth it would be nice. Even an edited version where we can see where the problem arise.

tomnatan added inline comments.Jul 7 2023, 10:29 AM
flang/test/Fir/commute.fir
11

Doesn't this mean that all tests in the fail are expected to fail whereas here we only except one of them?

clementval added inline comments.Jul 7 2023, 10:30 AM
flang/test/Fir/commute.fir
11

This file is treated as a single test so it would be fine.

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

The example is an internal Google mlir dump, I'm not sure if I can post it publicly as not all dialects are open sourced. But it's essentially two very big but identical regions for which IsRegionEquivalentTo returns false, but returns true with this change.

I believe the way to expose this in a test upstream is to write a test pass that exposes the issue. That could be a pass that takes every functions in the input module and do pair-wise operation equivalence between them, and returns true/false.

Something I'm unsure about is if the failure would be deterministic: since we use pointer values this seems like it'll depend on the actual allocation...

tomnatan added a comment.EditedJul 7 2023, 10:49 AM

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

The example is an internal Google mlir dump, I'm not sure if I can post it publicly as not all dialects are open sourced. But it's essentially two very big but identical regions for which IsRegionEquivalentTo returns false, but returns true with this change.

I believe the way to expose this in a test upstream is to write a test pass that exposes the issue. That could be a pass that takes every functions in the input module and do pair-wise operation equivalence between them, and returns true/false.

Something I'm unsure about is if the failure would be deterministic: since we use pointer values this seems like it'll depend on the actual allocation...

Ended up finding a very simple reproducer:

func.func @main(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = func.call @f1(%arg0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  %1 = func.call @f2(%0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  return %1 : tensor<4x8xf32>
}
func.func @f1(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}
func.func @f2(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}

If you run mlir-opt %s --duplicate-function-elimination it will dedup the functions only with this revert.

I can include this test case in duplicate-function-elimination.mlir

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

The example is an internal Google mlir dump, I'm not sure if I can post it publicly as not all dialects are open sourced. But it's essentially two very big but identical regions for which IsRegionEquivalentTo returns false, but returns true with this change.

I believe the way to expose this in a test upstream is to write a test pass that exposes the issue. That could be a pass that takes every functions in the input module and do pair-wise operation equivalence between them, and returns true/false.

Something I'm unsure about is if the failure would be deterministic: since we use pointer values this seems like it'll depend on the actual allocation...

Ended up finding a very simple reproducer:

func.func @main(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = func.call @f1(%arg0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  %1 = func.call @f2(%0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  return %1 : tensor<4x8xf32>
}
func.func @f1(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}
func.func @f2(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}

If you run mlir-opt %s --duplicate-function-elimination it will dedup the functions only with this revert.

I can include this test case in duplicate-function-elimination.mlir

Thanks for the example. It is useful. So if reverted your dedub example works but CSE will not work anymore in example like the flang one. Is that correct?

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

The example is an internal Google mlir dump, I'm not sure if I can post it publicly as not all dialects are open sourced. But it's essentially two very big but identical regions for which IsRegionEquivalentTo returns false, but returns true with this change.

I believe the way to expose this in a test upstream is to write a test pass that exposes the issue. That could be a pass that takes every functions in the input module and do pair-wise operation equivalence between them, and returns true/false.

Something I'm unsure about is if the failure would be deterministic: since we use pointer values this seems like it'll depend on the actual allocation...

Ended up finding a very simple reproducer:

func.func @main(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = func.call @f1(%arg0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  %1 = func.call @f2(%0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  return %1 : tensor<4x8xf32>
}
func.func @f1(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}
func.func @f2(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}

If you run mlir-opt %s --duplicate-function-elimination it will dedup the functions only with this revert.

I can include this test case in duplicate-function-elimination.mlir

Thanks for the example. It is useful. So if reverted your dedub example works but CSE will not work anymore in example like the flang one. Is that correct?

Yes exactly, but keeping it as is means the isRegionEquivalentTo and IsEqivalentTo can return incorrect results for regions/ops that are actually identical (not just equivalent).

Can we have an example where it fails for you? I'm not saying that we should not revert but it would be good to have an real world example because what was discussed in other patch is downstream AFAIK.

The example is an internal Google mlir dump, I'm not sure if I can post it publicly as not all dialects are open sourced. But it's essentially two very big but identical regions for which IsRegionEquivalentTo returns false, but returns true with this change.

I believe the way to expose this in a test upstream is to write a test pass that exposes the issue. That could be a pass that takes every functions in the input module and do pair-wise operation equivalence between them, and returns true/false.

Something I'm unsure about is if the failure would be deterministic: since we use pointer values this seems like it'll depend on the actual allocation...

Ended up finding a very simple reproducer:

func.func @main(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = func.call @f1(%arg0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  %1 = func.call @f2(%0, %arg1) : (tensor<4x8xf32>, tensor<4x8xf32>) -> tensor<4x8xf32>
  return %1 : tensor<4x8xf32>
}
func.func @f1(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}
func.func @f2(%arg0: tensor<4x8xf32>, %arg1: tensor<4x8xf32>) -> tensor<4x8xf32> {
  %0 = mhlo.add %arg0, %arg1 : tensor<4x8xf32>
  %1 = mhlo.add %0, %arg1 : tensor<4x8xf32>
  %2 = mhlo.add %1, %1 : tensor<4x8xf32>
  %3 = mhlo.add %2, %1 : tensor<4x8xf32>
  return %3 : tensor<4x8xf32>
}

If you run mlir-opt %s --duplicate-function-elimination it will dedup the functions only with this revert.

I can include this test case in duplicate-function-elimination.mlir

Thanks for the example. It is useful. So if reverted your dedub example works but CSE will not work anymore in example like the flang one. Is that correct?

Yes exactly, but keeping it as is means the isRegionEquivalentTo and IsEqivalentTo can return incorrect results for regions/ops that are actually identical (not just equivalent).

I think there was also a suggestion to have different algo for CSE and isRegionEquivalentTo. But that would require work anyway.

tomnatan updated this revision to Diff 538566.Jul 10 2023, 3:22 AM

change fir test

Is there anything else blocking us from submitting this CL? I think we should first fix the correctness issue and then move the sorting of operands to the beginning of CSE as suggested in D129480.

flang/test/Fir/commute.fir
11

I changed the test to have the add op with reversed operands, wouldn't it be better than just having XFAIL? i.e. this test will start failing if we introduce a proper fix for commutative ops, and to fix it we just need to remove the extra add in the result.

clementval added inline comments.Jul 10 2023, 8:37 AM
flang/test/Fir/commute.fir
11

That's what I was suggesting at first. Keep the test as it was and mark it XFAIL.

tomnatan added inline comments.Jul 10 2023, 9:14 AM
flang/test/Fir/commute.fir
11

Not sure I follow why we need to mark it as XFAIL as it currently passes because I added the additional add op that isn't csed and when we support commutative sorting it will fail because of the additional add op. Does that makes sense?

clementval added inline comments.Jul 10 2023, 9:26 AM
flang/test/Fir/commute.fir
11

My suggestion is to not alter this test as we want it to be like that when CSE works for commutative op. Marking it as XFAIL with no other modification is like saying this is a TODO.

tomnatan added inline comments.Jul 10 2023, 9:29 AM
flang/test/Fir/commute.fir
11

But does this test actually need to depend on the TODO to support this? What's the harm of updating is as I did and later having to revert back when and if commutative is supported properly? And also what about @f2 here? We will be hiding future failures it.

clementval added inline comments.Jul 10 2023, 9:33 AM
flang/test/Fir/commute.fir
11

What's the point to have a test for commutative op CSE that doesn't work as intended. I would prefer to mark it as XFAIL.

tomnatan updated this revision to Diff 538714.Jul 10 2023, 10:10 AM

revert change in flang test and add XFILE instead

tomnatan marked an inline comment as done.Jul 10 2023, 10:11 AM
tomnatan added inline comments.
flang/test/Fir/commute.fir
11

I see your point, done. Also added https://github.com/llvm/llvm-project/issues/63784 for tracking.

tomnatan marked an inline comment as done.Jul 12 2023, 3:59 AM

Is there anything else blocking us from merging this change? We continue to discuss a proper solution separately.

mehdi_amini accepted this revision.Jul 12 2023, 11:30 AM
This revision is now accepted and ready to land.Jul 12 2023, 11:30 AM

Is there anything else blocking us from merging this change? We continue to discuss a proper solution separately.

@tomnatan Do you have any plan on working to add the proper solution?

Is there anything else blocking us from merging this change? We continue to discuss a proper solution separately.

@tomnatan Do you have any plan on working to add the proper solution?

I'm afraid I can't commit to taking this on as I won't have time to do this in the near future. But I think we should first fix the correctness issue and then prioritize a solution.

@clementval can I go ahead and merge the change?

@clementval can I go ahead and merge the change?

Can you change the title of this commit before landing it. It's not a pure revert of D123492.

tomnatan retitled this revision from [MLIR] Revert `D123492` that added the ability to remove commutative operations to [MLIR] Don't sort operand of commutative ops when comparing two ops as there is a correctness issue.Jul 14 2023, 3:28 PM
tomnatan edited the summary of this revision. (Show Details)

@clementval can I go ahead and merge the change?

Can you change the title of this commit before landing it. It's not a pure revert of D123492.

Done.

jpienaar accepted this revision.Jul 14 2023, 3:32 PM